42

I'm brand new to UNIX and I am using Kirk McElhearn's "The Mac OS X Command Line" to teach myself some commands.

I am attempting to use tr and grep so that I can search for text strings in a regular MS-Office Word Document.

$ tr '\r' '\n' < target-file | grep search-string

But all it returns is:

Illegal byte sequence.

robomechanoid:Position-Paper-Final-Draft robertjralph$ tr '\r' '\n' < Position-Paper-Final-Version.docx | grep DeCSS
tr: Illegal byte sequence
robomechanoid:Position-Paper-Final-Draft robertjralph$ 

I've actually run the same line on a script that I created in vi and it does the search correctly.

polym
  • 10,672
  • 9
  • 41
  • 65
user74886
  • 421
  • 1
  • 4
  • 4
  • I can't see why tr would complain, did you type the same as you put in the question? grep will not find what you want, xdoc is an ill defined standard. No one really knows what is on those files, people have reverse engineered it, apparently the standard was of no help. – ctrl-alt-delor Jul 08 '14 at 22:36

2 Answers2

46

grep is a text processing tool. It expects their input to be text files. It seems that the same goes for tr on macOS (even though tr is supposed to support binary files).

Computers store data as sequences of bytes. A text is a sequence of characters. There are several ways to encode characters as bytes, called character encodings. The de facto standard character encoding in most of the world, especially on OSX, is UTF-8, which is an encoding for the Unicode character set. There are only 256 possible bytes, but over a million possible Unicode characters, so most characters are encoded as multiple bytes. UTF-8 is a variable-length encoding: depending on the character, it can take from one to four bytes to encode a character. Some sequences of bytes do not represent any character in UTF-8. Therefore, there are sequences of bytes which are not valid UTF-8 text files.

tr is complaining because it encountered such a byte sequence. It expects to see a text file encoded in UTF-8, but it sees binary data which is not valid UTF-8.

A Microsoft Word document is not a text file: it's a word processing document. Word processing document formats encode not only text, but also formatting, embedded images, etc. The Word format, like most word processing formats, is not a text file.

You can instruct text processing tools to operate on bytes by changing the locale. Specifically, select the “C” locale, which basically means means “nothing fancy”. On the command line, you can choose locale settings with environment variables.

export LC_CTYPE=C
tr '\r' '\n' < target-file | grep search-string

This will not emit any error, but it won't do anything useful either since target-file is still a binary file which is unlikely to contain most search strings that you'll specify.

Incidentally, tr '\r' '\n' is not a very useful command unless you have text files left over from Mac OS 9 or older. \r (carriage return) was the newline separator in Mac OS before Mac OS X. Since OSX, the newline separator is \n (line feed, the unix standard) and text files do not contain carriage returns. Windows uses the two-character sequence CR-LF to represent line breaks; tr -d '\r' would convert a Windows text file into a Unix/Linux/OSX text file.

So how can you search in a Word document from the command line? A .docx Word document is actually a zip archive containing several files, the main ones being in XML.

unzip -l Position-Paper-Final-Version.docx

Mac OS X includes the zipgrep utility to search inside zip files.

zipgrep DeCSS Position-Paper-Final-Version.docx

The result is not going to be very readable because XML files in the docx format mostly consist of one huge line. If you want to search inside the main body text of the document, extract the file word/document.xml from the archive. Note that in addition to the document text, this file contains XML markup which represents the structure of the document. You can massage the XML markup a bit with sed to split it into manageable lines.

unzip -p Position-Paper-Final-Version.docx word/document.xml |
sed -e 's/></>\n</g' |
grep DeCSS
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
  • 2
    +1 for good summary and extra bits. I have one thing to say though. To format the xml, you can use `xml_pp` it is in package `xml-twig-tools` on Debian Gnu+Linux (don't know an mac). – ctrl-alt-delor Jul 09 '14 at 08:23
  • 2
    Excel for Mac 2011 saves CSV files with \r line endings so this tr invocation is in fact quite relevant and useful. – Noah Yetter Feb 19 '15 at 23:33
  • 1
    As does Outlook for Mac 2011 when you export a tab delimited contacts list. – Ivan X Oct 21 '16 at 02:02
  • 1
    Well, I do not have enough reputation to downvote this, but this answer is utterly incorrect. It starts with "`tr` [...] expect their input to be text files."; while [the POSIX specification clearly states "The standard input can be any type of file."](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html). Please correct your answer. – 7heo.tk Sep 26 '17 at 11:05
  • 1
    @7heo.tk “this answer is utterly incorrect” is a gross exageration, but you're right, `tr` is *supposed* to process binary input (in particular, it's supposed to process null bytes correctly). POSIX doesn't clearly specify how it's supposed to deal with input that isn't a sequence of characters, though. (If I was an implementer, I'd pass invalid byte sequences through unmodified (or remove them with `-s`), *and* raise a defect with the standard committee.) Evidently, macOS's tr complains about them. – Gilles 'SO- stop being evil' Sep 26 '17 at 16:21
  • @Gilles It's not such an exaggeration since this the first thing that people read on your answer is incorrect. In most cases, that's most of what they will remember, and that's how you end up with engineers being "100% sure" about something completely false. Your answer has move upvotes than the question, take that into account when you say it's a "gross exaggeration". Also why treat the invalid bytes differently when using `-s`? Even if invalid, it's trivial to determine if bytes are repeating. Anyway, I would upvote your answer if you corrected it. – 7heo.tk Sep 26 '17 at 22:47
  • @7heo.tk Oops, I meant `-c`, not `-s`. I have corrected my answer. – Gilles 'SO- stop being evil' Sep 26 '17 at 22:56
  • @Gilles Thank you for correcting your answer! :) Also yes with `-c` (and `-C`) it makes total sense, I get what you meant. +1. Just an additional remark while I'm at it: on my colleague's MacBook, I noticed that with `LC_ALL=C`, the `tr` binary behaved as specified by POSIX (i.e. does not halt on bytes in range `0xF0..0xFF`), so I'm not sure if "It seems that the same goes for tr on macOS (even though tr is supposed to support binary files)." is accurate, since it accepts non-text input with the right locale ;) – 7heo.tk Sep 27 '17 at 10:56
25

I suppose that your charmap from the locales is UTF-8, so that you'll have problems on binary files. Just switch to C locale:

LC_ALL=C tr '\r' '\n' < target-file | LC_ALL=C grep search-string
vinc17
  • 11,912
  • 38
  • 45
  • you can use brackets to avoid specifying the language twice. `LC_ALL=C ( tr '\r' '\n' < target-file | grep search-string )`. However the docx is not C local. Is is utf16 and zipped and complex and anyone's guess. I would look as using a tool that can convert it to a different format that you can process e.g. html or odt (odt is also zipped, but well defined and easy to interpret). – ctrl-alt-delor Jul 08 '14 at 22:41
  • 1
    The syntax with the brackets (parentheses) doesn't work with all shells (not bash, not zsh, not dash). Then, concerning the MS Word file, it depends. I have some such files where the `strings` command gives clear text. – vinc17 Jul 08 '14 at 23:03
  • Alternatively, `( export LC_ALL=C; tr '\r' '\n' < target-file | grep search-string; )` should work. – vinc17 Jul 08 '14 at 23:05
  • 1
    `strings` has super powers: it can read files that are not just utf-8 or ascii text. – ctrl-alt-delor Jul 08 '14 at 23:06
  • Sorry about the `()` thing I thought that would work, thanks to @vinc17 for a fix. – ctrl-alt-delor Jul 08 '14 at 23:08
  • Yes, but `strings` won't find clear text in zipped files. So, this means that doc files are not necessarily zipped, so that `tr` and `grep` could work with them (under some locales). But I recommend to use `strings` first. – vinc17 Jul 08 '14 at 23:10
  • I just tried `strings` on a docx, it came up with the file names in the zip, plus a few other, but not the text in the doc. – ctrl-alt-delor Jul 08 '14 at 23:18
  • OK, so, since this is docx and not the original doc, you first need to convert it to text, e.g. with [`docx2txt`](http://docx2txt.sourceforge.net/). – vinc17 Jul 08 '14 at 23:22