Find all files in directory that have over 130 English characters or 43 asian characters

Question

I'm trying to transfer files on my NAS, but I will get this error "The name of a file or a folder within an encrypted shared folder cannot exceed 143 English characters or 47 Asian (CJK) characters" is there a command in the shell to find every file that meets that?

Please [do not cross-post](https://meta.stackexchange.com/a/64069/355310). For the record, another copy is [on Super User](https://superuser.com/q/1674690/432690). — Kamil Maciorowski, Sep 07 '21 at 05:44

score 1 · Answer 1 · answered Sep 07 '21 at 03:46

1

find path | grep -P '\/[^\/]{130,}[^\/]$'

Based on this source on stackoverflow.com: Find files that are too long for Synology encrypted shares

I added the $ at the end to just capture files and not folders.

Maybe you can find the CJK characters with a range of unicode. I don't think grep can do this, maybe ugrep.

answered Sep 07 '21 at 03:46

Gounou

549
3
5

That assumes file paths don't contain newline characters. That also assumes you're in the C locale. Otherwise `grep`'s `[^/]` would match on characters as defined in the user's locale, not necessarily bytes. Also note that `/` should not be escaped (with `-P`, those backslash are thankfully harmless, but they wouldn't without). – Stéphane Chazelas Sep 07 '21 at 16:01

Stéphane Chazelas · Answer 2 · 2021-09-07T16:05:21.497

I think they're trying to say that file names, when UTF-8 encoded, cannot contain more than 143 bytes on the assumption that many Asian characters (whatever they meant) are encoded on 3 bytes in UTF-8 (you'll notice that 48 x 3 is 144) and most English characters are encoded on one byte¹.

So to find those:

In zsh:

set +o multibyte -o extendedglob
print -rC1 -- **/?(#c144,)(ND)

For those that are above the limit and

set +o multibyte -o extendedglob
print -rC1 -- **/?(#c1,143)(ND)

For those that are under the limit.

To see more easily the ones that are of type directory, you can add the M (for Mark) glob qualifier (print -rC1 -- **/?(#c144,)(NDM)), which will append a / to directories.

Or with find:

LC_ALL=C find . -name '????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????*'

For those above the limit.

That's 144 ?s above; you can also construct the pattern with:

pattern=$(printf %145s '*' | tr ' ' '?')
LC_ALL=C find . ! -name "$pattern"

Or in zsh:

pattern=${(l[145][?]):-*}

Replace -name with ! -name to get those filenames that are under the limit (are made of 143 bytes or less).

With the GNU implementation of find, you can also do:

LC_ALL=C find . -regextype posix-extended -regex '.*/[^/]{144,}'

^{¹ The UTF-8 encoding encodes characters on 1 to 4 bytes (initially the algorithm was designed to encode code points up to U+7FFFFFF on up to 6 bytes, but Unicode code points have been later restricted to U+10FFFF), only the characters from the US-ASCII set (codepoints U+0000 to U+007F) are encoded on one byte. The ones encoded on 3 bytes are the ones from U+0800 to U+FFFF.}

Find all files in directory that have over 130 English characters or 43 asian characters

2 Answers2