10

Is there a way to do it, particularly on a set of multiple epub/mobi files in one directory?

InquilineKea
  • 6,152
  • 12
  • 43
  • 42

5 Answers5

8

You can easily grep these files by providing the -a option to interpret the files as ascii:

grep -a "author" *.epub *.mobi

The above works on all of my 1000+ EPUB and MOBI files, giving the expected results.

EPUB and MOBI are both container formats. EPUB is a essentially .zip file with some structural requirements, MOBI is a Palm Database Format file. Both formats allow for compressed or uncompressed data to be put in the containers.

If the data you are looking for is in a "file" within the container, and that file is compressed you will need to provide the compressed string not the expanded, uncompressed version of the string. In particular, if you are reading an EPUB/MOBI on an ebook reader, you will of course generally not find a word 'abcde' you just read by using grep -a 'abcde' on all EPUB and MOBI files, as the contents of the book are likely (but not necessarily, it is just an efficiency measure) in compressed "files" in the container.

This is not a problem of grep being incapable of searching in these files, but of you not providing the correct search string. The same would happen if you read a file with Japanese text using some Japanese to English translation software and then hoped you could find the English words by grepping the original file. With -a and the correct Japanese (binary) word patterns, grep would work just fine.

Anthon
  • 78,313
  • 42
  • 165
  • 222
  • seems `-a` treat all files a ascii encoding, so not work with no-english epub books. – Moon soon Oct 11 '19 at 13:32
  • epub books are zipped HTML files, you cannot grep on those at all without unpacking – Anthon Oct 11 '19 at 19:26
  • One can grep on dynamically unzipped directories with find and zipgrep. – Leandro Jul 04 '20 at 20:12
  • @Leandro That is what [mosh](https://unix.stackexchange.com/a/389119/33055) answered a few years ago. And since .mobi files are modified .pdb files and not related to .zip (or .epub) zipgrep won't work on them – Anthon Jul 05 '20 at 05:14
  • This did not work for me. Try, [mosh`s](https://unix.stackexchange.com/a/389119/206574) answer. – Ahmad Ismail Jul 18 '21 at 15:08
6

This worked on windows7+cygwin; search text inside the zip archives.

c:\> zipgrep "regex" file.epub    

shell script in c:/cygwin/bin/zipgrep, and this also works:

c:\> unzip -p "*.epub" | grep -a --color regex

-p is for pipe.

grep-epub.sh script

PAT=${1:?"Usage: grep-epub PAT *.epub files to grep"}
shift
: ${1:?"Need epub files to grep"}
for i in $* ;do
  echo $0 $i
  unzip -p $i "*.htm*" "*.xml" "*.opf" |  # unzip only html and content files to stdin
    perl -lpe 's![<][^>]{1,200}?[>]!!g;' | # get rid of small html <b>tags
    grep -Pinaso  ".{0,60}$PAT.{0,60}" | # keep some context around matches
    grep -Pi --color "$PAT"              # color the matches.
done 
mosh
  • 211
  • 2
  • 4
1

The epub format is a compressed binary file, so you must uncompress it before trying to parse the text. MOBI format doesn't appear to be plain text either, so, no, I would say that epub and mobi files can't be grepped since they are not plain text files. Use calibre or other reader that allows in-file searchs.

Braiam
  • 35,380
  • 25
  • 108
  • 167
  • ePub is not a compressed binary file, but a compressed directory tree; thus, it can be grepped. Same for Mobi, which is but a modification of an early version of ePub. – Leandro Jul 04 '20 at 20:08
  • @Leandro when I said binary I meant that isn't possible to read text without processing (ie. you can't directly grep it). That's what I said just in the next sentence. – Braiam Jul 05 '20 at 11:47
1

To search a compressed file you can use zgrep. This should work for epub since it is a compressed file. Here is some additional information on zgrep: http://manpages.ubuntu.com/manpages/oneiric/man1/zgrep.1.html

Andrew Stern
  • 524
  • 8
  • 9
  • `The supported compressors are bzip2, gzip, lzip and xz.` Neither MOBI or EPUB files are in either of these formats. `zgrep -a` doesn't find anything more than plain `grep -a` would do. – Anthon May 02 '14 at 06:35
  • 1
    This page seems to indicate that the epub is in zip format: http://www.mobileread.com/forums/showthread.php?t=31040 . Also gzip supports the zip format. – Andrew Stern May 02 '14 at 13:15
  • 1
    Of course EPUB is a zip file. gzip however only supports extracting zip files with a single member (read the gzip man page). Since the first file in an EPUB file according to the standard has to be the "mimetype" file (with has as content the 20 byte string `application/epub+zip`), how is that gone help unless you search for any of those 3 words? – Anthon May 02 '14 at 13:32
  • epub is compressed in a zip format. It seems that gzip doesn't uncompress this file but unzip will. After decompression I found the text of the book in index_split_001.xhtml but I don't know if that is true of every epub. It should be possible to unzip the contents of the file then recompress the contents into a .gz file so that zgrep would work. I haven't found a simple one line command to do this conversion. – Andrew Stern May 02 '14 at 13:44
  • One does not need to unzip first, one can do it dynamically with find and zipgrep. – Leandro Jul 04 '20 at 20:09
0

One can combine former answers with find:

find . -name "*.epub" -exec zipgrep pattern {} \;

This way one can search in a directory tree, obviating the need for all files to be on the same directory level.

Leandro
  • 228
  • 1
  • 11