39

In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there:

  • An easy way to list or otherwise identify these file names?
  • A way to generate printable non-English language characters - those characters that are not listed in the printable range of man ascii (so I can test that these files are being identified)?
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
suspectus
  • 5,890
  • 4
  • 20
  • 26

4 Answers4

52

Assuming that "foreign" means "not an ASCII character", then you can use find with a pattern to find all files not having printable ASCII characters in their names:

LC_ALL=C find . -name '*[! -~]*'

(The space is the first printable character listed on http://www.asciitable.com/, ~ is the last.)

The hint for LC_ALL=C is required (actually, LC_CTYPE=C and LC_COLLATE=C), otherwise the character range is interpreted incorrectly. See also the manual page glob(7). Since LC_ALL=C causes find to interpret strings as ASCII, it will print multi-byte characters (such as π) as question marks. To fix this, pipe to some program (e.g. cat) or redirect to file.

Instead of specifying character ranges, [:print:] can also be used to select "printable characters". Be sure to set the C locale or you get quite (seemingly) arbitrary behavior.

Example:

$ touch $(printf '\u03c0') "$(printf 'x\ty')"
$ ls -F
dir/  foo  foo.c  xrestop-0.4/  xrestop-0.4.tar.gz  π
$ find -name '*[! -~]*'       # this is broken (LC_COLLATE=en_US.UTF-8)
./x?y
./dir
./π
... (a lot more)
./foo.c
$ LC_ALL=C find . -name '*[! -~]*'
./x?y
./??
$ LC_ALL=C find . -name '*[! -~]*' | cat
./x y
./π
$ LC_ALL=C find . -name '*[![:print:]]*' | cat
./x y
./π
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
Lekensteyn
  • 20,173
  • 18
  • 71
  • 111
  • 1
    Be aware that you have file names that are using foreign character sets that are incompatible with UTF-8 or ASCII. In those cases, you may see question marks instead of characters. – Lekensteyn Jan 17 '14 at 11:23
  • 1
    +1, but I would use `LC_ALL=C` instead of `LC_COLLATE=C` as it's doesn't make much sense to set LC_COLLATE to C without setting `LC_CTYPE` and to make sure it still works even when the LC_ALL variable is in the environment. – Stéphane Chazelas Jan 17 '14 at 11:47
  • If `SPC` is _printable_, then what about `TAB` and `LF` which are also typically found in text files? – Stéphane Chazelas Jan 17 '14 at 11:56
  • 1
    Thanks - this found six files, which had long hyphen, short hyphen and a variant of single quote. These had all originated from MS Word. No difference in the files listed between LC_ALL and LC_COLLATE. LC_COLLATE displayed the non-ASCII chars correctly whereas LC_ALL displayed ??? instead. Excellent answer! – suspectus Jan 17 '14 at 12:35
  • 1
    @suspectus I updated by answer based on suggestions from Stephane. For `LC_COLLATE` and `LC_CTYPE`, see also the `find(1)` manpage. – Lekensteyn Jan 17 '14 at 12:44
5

If you translate each file name using tr -d '[\200-\377]' and compare it with the original name, then any file names that have special characters will not be the same.

(The above assuming that you mean non-ASCII with foreign)

Timo
  • 6,202
  • 1
  • 26
  • 28
3

You can use tr to delete any foreign character from a filename and compare the result with the original filename to see if it contained foreign characters.

find . -type f > filenames
while read filename; do
      stripped="$(printf '%s\n' "$filename" | tr -d -C '[[:alnum:]][[:space:]][[:punct:]]')"
      test "$filename" = "$stripped" || printf '%s\n' "$filename"; 
done < filenames
Ernest A
  • 1,833
  • 4
  • 20
  • 28
  • 5
    that is a nice extension to my answer, but it is too simple, file names can have newlines in them and then your script will not work – Timo Jan 17 '14 at 11:16
  • 1
    If you want to post-process `find` output, use NUL-terminated output/input as shown in [this answer](http://superuser.com/a/702461/47108). – Lekensteyn Jan 17 '14 at 11:20
1

The accepted answer is helpful, but if your filenames are already in the encoding specified in LANG/LC_CTYPE, it's better to just do:

LC_COLLATE=C find . -name '*[! -~]*'

Character classes are affected by LC_CTYPE, but the above command does not use character classes, only ranges, so LC_CTYPE just prevents the unusual characters from being replaced by question marks.

SamB
  • 430
  • 3
  • 13