3

I have a bunch of technical books, and I have been using pdfgrep for a while, but it takes substantial amount of time for searching all.

can somebody recommend me of a cli tool for searching in pdf files super fast?

it should have an underline database for caching purposes - similar to locate command but just for pdf's keywords.

Thank you all! :)

JammingThebBits
  • 426
  • 4
  • 13
  • How do you ordinarily use `pdfgrep`? Do you use the `--cache` or `--page-range` options, for example? Or do you often want to find the _first_ match? – Kusalananda Aug 15 '18 at 11:29
  • Ohh ! According to my pdfgrep's manual on my system (pdfgrep version 1.4.1), there is no cache option. from which version the cache feature was inserted? – JammingThebBits Aug 15 '18 at 11:42
  • 1
    I was reading the latest manual at https://pdfgrep.org/doc.html The option seems to have been added in release 2.0 – Kusalananda Aug 15 '18 at 14:02

1 Answers1

1

As an alternative to pdfgrep you can use rga.

rga performs a recursive search with caching enabled by default.

I did a quick comparison with a 15 GB PDF collection stored on an SSD.

$ gtime --format "%Es" pdfgrep --recursive --cache --ignore-case conclusion
2:15:26s # initial run
3:05.30s # with cache

$ gtime --format "%Es" rga --type pdf conclusion
33:26.96s # initial run
1:18.70s  # with cache

$ gdu -sh --apparent-size ~/.cache/pdfgrep
697M    /Users/sschmidt/.cache/pdfgrep

$ gdu -sh --apparent-size ~/Library/Caches/rga
186M    /Users/sschmidt/Library/Caches/rga

So rga was about 4x faster than pdfgrep on the initial run and about 2x faster on the cached run. Apart from that the size of the rga cache was only about a quarter of the pdfgrep cache. This if course just my specific setup, so results may vary depending on your configuration.