3

How can I can I find files that contain a specific pattern on a specific line number ? Let's assume I have a directory with a bunch of text files containing 3 lines, such as:

Title A
Category X
Description Y

How can I grep / filter every files that have Category X on line 2? How can I find files that have Title A as line 1?

I looked at the grep man page, ripgrep and alternative but not sure you can limit the search of a pattern to specific line numbers.

Pierre-Jean
  • 2,229
  • 1
  • 15
  • 18

5 Answers5

9

You can use awk like that:

awk 'FNR == 2 && /Category X/ {print FILENAME}' *
Arkadiusz Drabczyk
  • 25,049
  • 5
  • 53
  • 68
  • Nice one! Do you think performance wise it is crazy to do so / not elegant or you believe it's ok? – Pierre-Jean Feb 13 '22 at 18:45
  • It should be ok on today's computers unless you really have millions of millions huge files. – Arkadiusz Drabczyk Feb 13 '22 at 18:46
  • 2
    Is always amazing to me the things `awk` can do with a minuscule amount of code. – Seamus Feb 13 '22 at 18:49
  • @Seamus: do you mean that `*` is not recursive in shell? – Arkadiusz Drabczyk Feb 13 '22 at 18:49
  • 1
    @pierre-jean: you could use `nextfile` on FNR==3, if your `awk` version provides it. – RudiC Feb 13 '22 at 18:51
  • I imagine that prefixing with find and xargs we could achieve a recursive approach instead of wildcard? – Pierre-Jean Feb 13 '22 at 18:52
  • 1
    @Pierre-Jean: yes, it could. In bash you could also use `**` – Arkadiusz Drabczyk Feb 13 '22 at 18:53
  • @ArkadiuszDrabczyk: ha ha ha :) I didn't see the `*` - or maybe I mistook for dirt on my screen :) So yes, good answer! – Seamus Feb 13 '22 at 18:53
  • 1
    @ArkadiuszDrabczyk Not millions of files: the constraint is the length of the filenames in the args array (which is provided on-stack to the process). Usual limit on that is 2MB on actual pathnames (not file count). As we can break at line 3 of each file (with `nextfile`), the file size is of no relevance, as it does not get read. – Paul_Pedant Feb 13 '22 at 20:50
  • @Paul_Pedant: but OP asked about the performance - the more files there are the more time it will take to process them but the relation doesn't have to be linear on today's multi-core systems. – Arkadiusz Drabczyk Feb 13 '22 at 20:54
  • @ArkadiuszDrabczyk If it fails to run because the args list exceeds the limits, then performance is of no relevance at all. It is that little `*` thingy at the end that can cause all the trouble. `find` with the `{} '+'` action deals with that issue, and the recursion. As for multi-core systems, unless you specifically parallelise the process somehow, awk will flog along on a single CPU (at least it can flick between cores but no more than one at a time). `find -exec awk` will use about 1.2 cores, and it does not have a `-parallel` option. – Paul_Pedant Feb 13 '22 at 21:16
  • @Paul_Pedant: sorry but I don't know why are you bringing this up? I understand the problem but how is it related to OP's question? `*` can always expand to something that won't fit in args array, you suggest it shouldn't be used at all? Why using `find` if you can do everything in `awk`? Kernel can always migrate and preempt any thread it wants at any time. Where does `about 1.2` number come from? – Arkadiusz Drabczyk Feb 13 '22 at 21:28
  • Because the OP asked the performance question in a comment to your specific solution, and you said it should not be a problem with a huge number of files, yet your solution *fails* in the case of millions of files, and is grossly *inefficient* in the case of large files. And then you doubled down by suggesting a multi-core complication would work, instead of considering that only reading two lines of each file might be more reasonable. And ... – Paul_Pedant Feb 13 '22 at 21:45
  • Sure, Kernel can pre-emp any thread, but that does not mean it runs multicore, it just jumps around between cores a lot -- it still only has one thread. The 1.2 is experience that find generally takes 20% of one CPU to keep the -exec task 100% busy on another CPU. – Paul_Pedant Feb 13 '22 at 21:46
  • @Paul_Pedant: _and you said it should not be a problem with a huge number of files_ - I said something completely opposite. _ And then you doubled down by suggesting a multi-core complication would work_ - sorry, what complication? Can you explain who are you having this conversation with? – Arkadiusz Drabczyk Feb 13 '22 at 21:55
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/134095/discussion-between-paul-pedant-and-arkadiusz-drabczyk). – Paul_Pedant Feb 13 '22 at 21:58
7

Your can use find with awk in order to exit processing the rest of the file when pattern found in the 2nd line, or even also exit if it was not found that in the 2nd line.

find -type f -name 'xyz*.txt' -exec \
    awk 'NR==2{ if(/pattern/) print FILENANE; exit }' {} \;
αғsнιη
  • 40,939
  • 15
  • 71
  • 114
  • why not `awk 'NR == 2 && /pattern/' {print FILENAME; exit}` ? – DanieleGrassini Feb 13 '22 at 22:10
  • 3
    @DanieleGrassini Because if Line 2 does not match the pattern, it will not take the action, will not exit, and will read the other million lines of the file waiting hopelessly for NR == 2 to happen again. – Paul_Pedant Feb 13 '22 at 22:22
6

grep only, for the fun of it:

PAT="Category X"
LN=2
> grep -n "$PAT" file* | grep ":$LN:$PAT$" | grep -o "^[^:]*"
file1
file2
αғsнιη
  • 40,939
  • 15
  • 71
  • 114
RudiC
  • 8,889
  • 2
  • 10
  • 22
  • I didn't realize that using the line number option plus piping several search could be an answer to my problem! I need to check the comment above for the issue raised but I like the initial idea at least! Nice one! – Pierre-Jean Feb 13 '22 at 18:57
4

To test just the files in your current directory (assuming there's no sub-dirs or unreadable files in this directory and not so many as to exceed ARGS_MAX) would be:

awk 'FNR==2{ if (/Category X/) print FILENAME; nextfile }' *

but from your comments it sounds like you want to descend a hierarchy which would be:

find . -type f -exec \
    awk 'FNR==2{ if (/Category X/) print FILENAME; nextfile }' {} +

The use of + in the find command (may require GNU find) will cause it to run awk on batches of files instead of 1 at a time, and the use of nextfile (if your awk supports it - many do, some don't) will cause awk to stop reading the current file and move on to the next one once the 2nd line is read. Since your input files are each only 3 lines long it'll be very efficient whether your awk supports nextfile or not.

Ed Morton
  • 28,789
  • 5
  • 20
  • 47
3

GNU grep can be used for your use case:

$ grep -Plzr '^(?:.*\n){1}.*Category X' .

grep normally works on a per-line basis, but GNU grep has added the -z option where in it treats the whole file as a line because it separates records on a character NOT found in text files (\0).

So now we can apply the regex on the whole file. Your requirement is searching only the second line, hence we drive past one line without doing anything ^(?:.*\n){1}

The caret ^ anchors the regex to begin from the beginning. The dot can not span lines because it doesn't match a newline.

Then the .*Category X will start looking in the next line, meaning the second, but won't span lines , so it matches if the pattern is found on the second line.

If there's a match, the -l option will list out the filename to STDOUT.

The -r option will make grep run recursively (GNU feature).

The -P will enable to write Perl style regexes (GNU feature).


Here is another stab at the problem, with GNU find+sed combo:

$ find . -type f -exec sed -ns '2{/Category X/F;}' {} +

GNU find + GNU xargs feed into Perl can also do it:

find . -type f ! -size 0 -print0 |
xargs -r0 perl -lne '
  (eof||$.==2)&&do{
    print $ARGV if $.==2 && /Category X/;
    close  ARGV; undef $.;
  };
'
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
guest_7
  • 5,698
  • 1
  • 6
  • 13