Some code I am working with has a bunch of comments written in Japanese and I am working on translating them to English. Is there some way to "grep" for all lines containing Japanese characters or at least any non-ascii characters?
-
1Does it have to be grep? perl has more extensive unicode support, I think e.g. `print if /\P{ASCII}/` or possibly `print if /\p{Hiragana}/`, `print if /\p{Katakana}/` etc. See for example [How Can I Run a Regex that Tests Text for Characters in a Particular Alphabet or Script?](http://stackoverflow.com/a/8334213) – steeldriver Apr 01 '15 at 00:55
-
@steeldriver: Perl is OK. But how do I run that search for every file in a directory, recursively? And is it going to print file names and line numbers like grep does? (You can put that as an answer, btw) – hugomg Apr 01 '15 at 02:22
-
1OK my perl-fu is not strong but I will try to put together an answer: in the meantime, I found this near-duplicate that you may find helpful [grep: Find all lines that contain Japanese kanjis](http://unix.stackexchange.com/q/65715/65304) – steeldriver Apr 01 '15 at 02:38
-
If the characters you're looking for are comprised of invalid byte-sequences in your current encoding, then you can probably just find them like: `grep -xv '.*' *` because the `.*` will only match a line head to tail if it is comprised entirely of characters. – mikeserv Apr 01 '15 at 06:19
4 Answers
Grepping for non-ASCII characters is easy: set a locale where only ASCII characters are valid, search for invalid characters.
LC_CTYPE=C grep '[^[:print:]]' myfile
If you want to search for Japanese characters, it's a bit more complicated. With grep, you'll need to make sure that your LC_CTYPE locale setting matches the encoding of the files. You'll also need to make sure that your LC_COLLATE setting is set to Japanese if you want to use a character range expression. For example, on Linux (I determined the first and last character that's considered Japanese by looking at the LC_COLLATE section /usr/share/i18n/locales/ja_JP):
LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep '[。-龥]' myfile
or if you want to stick to ASCII in your script
LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 egrep $'[\uff61-\u9fa5]' myfile
This includes a few punctuation characters that are also used in English such as ⓒ and ×.
Perl has built-in features to classify characters. You can use the \p character class to match characters based on Unicode properties. Pass the command line switch -CSD to tell Perl that everything is in Unicode with the UTF-8 encoding.
perl -CSD -ne 'print if /\p{Hiragana}|\p{Katakana}/' myfile
If your files aren't encoded in UTF-8, you'll have to call binmode explicitly to tell Perl about their encoding. That's too advanced a perllocale usage for me. Alternatively you can first recode the line into UTF-8.
Alternatively, in Perl, you can use numerical character ranges. For example, to search for characters in the Hiragana and Katakana Unicode blocks:
perl -CSD -ne 'print if /[\x{3040}-\x{30ff}]/' a
- 807,993
- 194
- 1,674
- 2,175
-
1The `grep [^[:print:]]` version is also printing tab characters. Is there a way to avoid that? BTW you were right about the file encodings, turns out it was actually EUCJP – hugomg Apr 02 '15 at 01:00
-
1@hugomg Add a tab inside the outer brackets: `grep '[^[:print:]TAB]' myfile` or `grep '[^TAB[:print:]]' myfile` or `grep $'[^[:print:]\t]' myfile` or `grep $'[^\t[:print:]]' myfile` (with an actual tab character instead of TAB). – Gilles 'SO- stop being evil' Apr 02 '15 at 01:11
-
In [this answer](http://unix.stackexchange.com/a/193686/32477) @janis suggests using `grep '[^[:print:][:space:]]'` to handle tab and space characters. – Christian Long Aug 30 '16 at 23:29
Try this:
grep '[^[:print:][:space:]]'
(Depending on your locale setting maybe you have to prepend it by LANG=C.)
- 14,014
- 3
- 25
- 42
-
This ends up with lots of false positives because its also printing lines with tabs ("\t") on them. – hugomg Apr 01 '15 at 00:33
-
If you don't mind using perl, it has more extensive Unicode support in the form of classes such as {Katakana} and {Hiragana} which I don't think are currently available in even in those versions of grep that provide some PCRE support. However it does appear to require explicit UTF-8 decoding e.g.
perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' somefile
To traverse directories like grep's -R, you could use the find command, something like
find -type f -exec perl -MEncode -ne 'print if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;
or to mimic recursive grep's default filename:match labeled output format,
find -type f -exec perl -MEncode -lne 'printf "%s:%s\n",$ARGV,$_ if decode("UTF-8",$_) =~ /\p{Hiragana}/' {} \;
- 522,931
- 91
- 1,010
- 1,501
- 78,509
- 12
- 109
- 152
-
Sadly, none of these worked for me, maybe because the file is encoded in iso-8859-1 (though fiddling with LC_CTYPE and the parameter to decode didn't seem to help). I managed to find a solution to my problem in the thread you linked to though :) – hugomg Apr 01 '15 at 03:17
-
@hugomg A file encoded in ISO 8859-1 cannot contain any Japanese characters. It's probably UTF-8, EUCJP or a JIS variant. – Gilles 'SO- stop being evil' Apr 01 '15 at 23:40
My files were encoded in iso-8859-1 so anything that tried to read the input in my default locale (utf-8) would not recognize the Japanese characters. In the end I managed to solve my problem with the following command:
env LC_CTYPE=iso-8859-1 grep -nP '[\x80-\xff]' ./*
-P is to allow for the Perllike syntax for character ranges.
-n is for printing the line numbers next to the line names
\x80 to \xff are the "non ascii" characters
Changing the LC_CTYPE environment variable to iso-8859-1 makes grep read my fields byte-by-byte and lets me detect any "extended ascii" bytes as possible Japanese characters. If I use the default system encoding of UTF-8 grep exits with an "invalid UTF-8 byte sequence in input" error.
- 5,543
- 4
- 35
- 53