1

A colleague created a build tree (via gradle :dependencies > dependencies.txt) and emailed it to me. I grepped for a library I wanted to know the version of so I executed:

grep log4j dependencies.txt

but got zero matches and my shell just printed a new prompt. Since it was a long file and I trusted grep, I didn't open it and check. Then after a lot of back-and-forth discussion I was told that the file was created on a Windows machine. Even then I was surprised that grep wouldn't work - the search string isn't interrupted by newlines. But after executing:

dos2unix dependencies.txt

Grep started showing the matches I wanted.

Obviously my understanding of how grep works was incorrect. Why would grep not behave the same way on file contents on different operating systems when the search term occurs without any newlines in between?

Further info

  • file dependencies.txt returns dependencies.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
  • LC_ALL=C grep log4j dependencies.txt returns nothing
  • grep o dependencies.txt returned Binary file depdencies.txt matches
  • grep --text dependencies.txt returned nothing
Sridhar Sarnobat
  • 1,692
  • 18
  • 27
  • I'll get this right eventually! `dependencies.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators` – Sridhar Sarnobat Mar 03 '21 at 19:04
  • 2
    "UTF-16", now there we are! – ilkkachu Mar 03 '21 at 19:05
  • `LC_ALL=C grep log4j dependencies.txt` still no matches. But I see we are getting somewhere – Sridhar Sarnobat Mar 03 '21 at 19:06
  • Yes `iconv -f utf-16 -t utf-8 dependencies.txt | grep log4j` works. I've seen in the past that there are about 8 env variables one can set to fix some wrong behavior but I am wondering if that will work here. Hopefuly something like that can be used in a set-and-forget fashion – Sridhar Sarnobat Mar 03 '21 at 19:10
  • Probably it does. But it would take me more experience to understand why NUL bytes are in the Windows version (according to my Mac). – Sridhar Sarnobat Mar 03 '21 at 19:16
  • 1
    @SridharSarnobat, using UTF-16/UCS-2 is more common in Windows. IIRC the internal Windows APIs like to use it, and so on. – ilkkachu Mar 03 '21 at 19:20
  • 3
    Related: [Why is it not possible to search through text file contents encoded in UTF-16?](https://unix.stackexchange.com/questions/363946/why-is-it-not-possible-to-search-through-text-file-contents-encoded-in-utf-16) – ilkkachu Mar 03 '21 at 19:20
  • That link is very useful. Probably that is the actual answer to my question. But thanks to everyone – Sridhar Sarnobat Mar 03 '21 at 19:22

1 Answers1

3

UTF-16 text consists of 16-bit pieces, so each letter is stored in at least two bytes. If it's just ASCII characters, every other byte is a zero byte (NUL byte, \0, not the character zero). Your Mac is very likely not set up to deal with that.

In particular, the NUL bytes are taken as string terminators in C, so many tools may not be able to deal with them at all. Even if they could deal with them, they might take each NUL as a distinct character, so you'd need something like l.o.g.4.j to match that string.

But the funny thing is, that NUL bytes aren't visible when printing, so if you were to e.g. cat the file to the terminal, it might look just normal...

The NULs are also the reason grep considers the file binary.

See also: What makes grep consider a file to be binary?

ilkkachu
  • 133,243
  • 15
  • 236
  • 397
  • Thanks for the answer. FYI cat does indeed print the contents normally, but piping that output to grep doesn't. Also, opening with vim and `/`-searching works. I was a bit embarassed when the vim part happened – Sridhar Sarnobat Mar 03 '21 at 19:14
  • @SridharSarnobat, yes, `cat` doesn't change the contents, so it doesn't matter if you do `cat file | grep`, or just `grep file` (or `grep < file`). The NULs come to your terminal, the terminal ignores them. `less` shows that as `U^@T^@F^@-^@1^@6^@ ^@t^@e^@x^@t^@ ^@c^@o^@n^@s^@i^@s^@t^@s` etc. with the `^@` in inverse color marking the NULs. – ilkkachu Mar 03 '21 at 19:16
  • Indeed `more` shows the padding characters (though not `less`, curiously). – Sridhar Sarnobat Mar 03 '21 at 19:17