6

I am thinking methods to make the search faster and/or better which principally uses fgrep or ag. Code which searches the word and case-insensitively at $HOME, and redirects a list of matches to vim

find -L $HOME -xtype f -name "*.tex" \
   -exec fgrep -l -i "and" {} + 2>/dev/null | vim -R -

It is faster with ag because of parallelism and ack

find -L $HOME -xtype f -name "*.tex" \
   -exec ag -l -i "and" {} + 2>/dev/null | vim -R -

Statistics

Small group average statistics with fgrep and ag by time

        fgrep   ag     terdon1  terdon2  terdon3  muru 
user    0.41s   0.32s  0.14s    0.22s    0.18s    0.12s
sys     0.46s   0.44s  0.26s    0.28s    0.30s    0.32s

Cases terdon1 and terdon3 can be equal fast. I get great fluctuations with those two. Some Ranking by sys time (not best criteria!)

  1. terdon1
  2. terdon2
  3. terdon3
  4. muru
  5. ag
  6. fgrep

Abbreviations

  • terdon1 = terdon-many-find-grep
  • terdon2 = terdon-many-find-fgrep
  • terdon3 = terdon-many-find-ag (without F because not exists in ag)

Other codes

muru's proposal in comments

grep -RFli "and" "$HOME" --include="*.tex" | vim -R -

OS: Debian 8.5
Hardware: Asus Zenbook UX303UA

Léo Léopold Hertz 준영
  • 6,788
  • 29
  • 91
  • 193

2 Answers2

6

Since you're using ack and The Silver Searcher (ag), it seems that you are OK with using additional tools.

A new tool in this space is ripgrep (rg). It is designed to be fast in both finding files to search (like ag) and also fast in searching files themselves (like plain old GNU grep).

For the example in your question, you might use it something like this:

rg --files-with-matches --glob "*.tex" "and" "$HOME"

The author of ripgrep posted a detailed analysis of how the different searching tools work, along with benchmark comparisons.

One of the benchmarks, linux-literal-casei, is somewhat similar to the task you describe. It searches over a large number of files in a lot of nested directories (the Linux codebase), searching for a case-insensitive string literal.

In that benchmark, rg was fastest when using a whitelist (like your "*.tex" example). The ucg tool also did well on this benchmark.

rg (ignore)         0.345 +/- 0.073 (lines: 370)
rg (ignore) (mmap)  1.612 +/- 0.011 (lines: 370)
ag (ignore) (mmap)  1.609 +/- 0.015 (lines: 370)
pt (ignore)        17.204 +/- 0.126 (lines: 370)
sift (ignore)       0.805 +/- 0.005 (lines: 370)
git grep (ignore)   0.343 +/- 0.007 (lines: 370)
rg (whitelist)      0.222 +/- 0.021 (lines: 370)+
ucg (whitelist)     0.217 +/- 0.006 (lines: 370)* 

* - Best mean time. + - Best sample time.

The author excluded ack from the benchmarks because it was much slower than the others.

RJHunter
  • 501
  • 3
  • 10
  • I opened a ticket about Debian installation option of `rg` here https://github.com/BurntSushi/ripgrep/issues/291 Let's hope there will be something soon. – Léo Léopold Hertz 준영 Dec 25 '16 at 09:18
  • 2
    You can download self contained binaries here: https://github.com/BurntSushi/ripgrep/releases --- Debian packages are being worked on. Ripgrep should be one of the first Rust programs packaged! – BurntSushi5 Dec 25 '16 at 12:35
  • @BurntSushi5 Please, let us know when the apt-get built of `rg` is ready. I want to try it as soon as possible. – Léo Léopold Hertz 준영 Dec 30 '16 at 12:54
  • You can try it now by just downloading the binary on github. Just drop it into your $HOME/bin and you're good to go! (But sure, I'll try to remember to leave a comment here when it is actually packaged.) – BurntSushi5 Dec 30 '16 at 14:03
4

You could probably make it a little bit faster by running multiple find calls in parallel. For example, first get all toplevel directories and run N find calls, one for each dir. If you run the in a subshell, you can collect the output and pass it to vim or anything else:

shopt -s dotglob ## So the glob also finds hidden dirs
( for dir in $HOME/*/; do 
    find -L "$dir" -xtype f -name "*.tex" -exec grep -Fli and {} + & 
  done
) | vim -R -

Or, to be sure you only start getting output once all the finds have finished:

( for dir in $HOME/*/; do 
    find -L "$dir" -xtype f -name "*.tex" -exec grep -Fli and {} + & 
  done; wait
) | vim -R -

I ran a few tests and the speed for the above was indeed slightly faster than the single find. On average, over 10 runs, the single find call tool 0.898 seconds and the subshell above running one find per dir took 0.628 seconds.

I assume the details will always depend on how many directories you have in $HOME, how many of them could contain .tex files and how many might match, so your mileage may vary.

terdon
  • 234,489
  • 66
  • 447
  • 667
  • 1
    I think `-l` implies `-m1`. – muru Dec 24 '16 at 13:24
  • 1
    @muru ah, yes it does. Which explains why I saw no difference when I added it :) Thanks. – terdon Dec 24 '16 at 13:35
  • 1
    @masi none, as far as I know. It's just that there will be a trade-off between running multiple finds and having the disk seek multiple locations. This sort of thing will always depend on the specific setup you have. – terdon Dec 24 '16 at 23:55
  • I tested some different algorithms in your proposal with `fgrep` and `ag`. Which of those approaches should be fastest in theory? - - There are some great fluctuations with your approach and your approach with `ag`. - - I think your approach may be the fastest but your appoach with `ag` is almost as fast but cannot confirm the result yet. – Léo Léopold Hertz 준영 Dec 25 '16 at 09:33
  • 2
    Note that ripgrep will automatically parallelize directory traversal for you. – BurntSushi5 Dec 25 '16 at 12:33