0

I have a a file that contains more than a hundred thousand of IDs. Each ID is composed of 8~16 hexadecimal digits:

178540899f7b40a3
6c56068d
8c45235e9c
8440809982cc
6cb8fef5e5
7aefb0a014a448f
8c47b72e1f824b
ca4e88bec
...

I need to find the related files in a directory tree that contains around 2×109 files.

Given an ID like 6c56068d219144dd, I can find its corresponding files with:

find /dir -type f -name '* 6[cC]56068[dD]219144[dD][dD] *'

But that takes at least two days to complete...

What I would like to do is to call find with as much -o -iname GLOB triplets as ARG_MAX allows.

Here's what I've thought of doing:

sed -e 's/.*/-o -iname "* & *"' ids.txt |
xargs find /dir -type f -name .

My problem is that I can't force xargs to take in only complete triplets.

How can I do it?

Fravadona
  • 541
  • 2
  • 11
  • Apologies to the OP and to ilkkachu. I thought I knew how `xargs` worked, but I was obviously wrong. Yet another reminder not to touch that utility again :-) – Kusalananda Aug 31 '23 at 20:13
  • Your idea was good, just missing an additional step that I'm conceiving right now – Fravadona Aug 31 '23 at 20:15
  • @ilkkachu Each call takes a long time, with almost no difference with any given number of arguments; if I can do the job with the least number of calls then it would be great. – Fravadona Aug 31 '23 at 20:42
  • @ilkkachu That's what I wrote as an answer after analising Kusalananda idea – Fravadona Aug 31 '23 at 20:43
  • 2
    Thanks for editing, but please don't put placeholder text since that makes the question completely useless. It's OK, we'll wait until you have finished editing. – terdon Sep 01 '23 at 12:32
  • 1
    With the edit, this does seem like a case where `find | grep` might make sense – muru Sep 01 '23 at 13:45
  • @muru Indeed, it seems simpler with `grep` – Fravadona Sep 01 '23 at 13:59
  • Similar: [Find command not working in for loop](//unix.stackexchange.com/q/703960) – Stéphane Chazelas Sep 01 '23 at 14:39

3 Answers3

2

That's the wrong approach, if the point is to find all the files whose name has one of those IDs as any one of their space delimited words, then you could do:

find /dir -type f -print0 |
  gawk '
    !ids_processed {ids[$0]; next}
    {
      n = split(tolower($NF), words, " ")
      for (i = 1; i <= n; i++)
        if (words[i] in ids) {
          print
          break
        }
    }' ids.txt ids_processed=1 RS='\0' FS=/ -

Then you process the file list only once, then looking up the 100k ids, is just a lookup in a hash table instead of a doing 100k regex/wildcard matchings.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
1

What I would do:

Write a script to save all the file names to a temporary:

# maybe run this from cron or behind inotifywait
find dir -type f -print > /tmp/filelist

Then do a lookup as needed using your input file:

fgrep -if hexids /tmp/filelist 

I might suggest using -wif instead of -if but from the other comments it's not clear that you are providing accurate information in your question. man grep for more information.

rand'Chris
  • 111
  • 2
0

Thanks to @Kusalananda, I thought of one possible solution:

The first step is to make each -a -b X triplet considered as a single argument by xargs. Then you re-split those single-argument-triplets in an inline sh script and call the utility in there.

... |
awk '{ printf("%s%c", $0, 0) }' |
xargs -0 sh -c '[ "$#" -gt 0 ] && { printf %s\\n "$@" | xargs "$0" }' my_command
Fravadona
  • 541
  • 2
  • 11
  • 1
    if you're on GNU, `xargs -d '\n'` would do to look at the lines as units. Or if not, `tr '\n' '\0' | xargs -0` is a bit shorter than the awk. And yeah, my idea for the splitting would have been something like `sh -c 'set -f; my_command $@' _`. – ilkkachu Aug 31 '23 at 21:00
  • 1
    Though now I do wonder if it's always safe wrt. the limit to transform the single arg `a b c` into the three `a`, `b` and `c`. If the [pointers to the arg strings count](https://unix.stackexchange.com/a/110301/170373) (theones passed to the `main()` of the executed program), then the effective size of the string would increase when split. Though I suppose what with the first xargs filling the available space pretty well, I guess you'd see it quickly if that issue shows up – ilkkachu Aug 31 '23 at 21:01
  • I can't use `my_command $@` as there are spaces in the third component of the triplet; that's also the reason for using `awk`, as I can do further escaping with it – Fravadona Aug 31 '23 at 21:28
  • Well, in that case, I would say your example data isn't representative (I'm not also sure I'd call something with spaces a "word" in the general). There's some difference between splitting on all spaces, vs. splitting max N times, vs. handling quotes while doing it. – ilkkachu Sep 01 '23 at 06:08
  • @ilkkachu I simplified the problem on purpose, because it isn't relevant when the splitting is done with `xargs`; also, I thought that `xargs` might have an obscure option to do the job easily. – Fravadona Sep 01 '23 at 08:48
  • yes, it's not relevant _if_ the splitting is done with `xargs`. But while xargs does support quotes and escaping, it does so with a syntax that's (slightly) different than e.g. the shell syntax, and there are way more tools that would lend themselves nicely to whitespace-separated inputs. Also, with questions posted on the site, it's often the case that the easiest tool to use is not the one the poster originally tried. But if the question doesn't represent the real data, it's impossible to know which tools would be valid. – ilkkachu Sep 01 '23 at 09:20
  • 2
    Someone could have spent time working on a solution with e.g. Perl, just to have you tell them that a-ha! the data is actually different and they wasted their time. Trying to help you. For free. So yeah, please try to avoid setting up even the opportunity for that to happen. If the data is actually something like `-a -b "foo bar"`, and known to be aimed for xargs, just say it. It only needs one sentence: "The data uses quotes and escapes as interpreted by xargs". – ilkkachu Sep 01 '23 at 09:23
  • @ilkkachu Well, such solution wouldn't help me but it would still be useful to a lot of people with simpler requirements. – Fravadona Sep 01 '23 at 09:40