0

Let's say I have a text file with billions of text lines sorted alphabetically, like

Bar=10
Foo=6
Naz=42

How can I search for the line starting with Foo in the most optimal way (the file contains billions of variables like this), knowing lines are sorted alphabetically and that the line I want to find must start (or "contain" if it's easier to search for) a specific text?


Edit:

This question can be considered as duplicate of https://askubuntu.com/q/423886/10473 Answer is to use look which is fast enough for such research

Xenos
  • 117
  • 4
  • What do you want out of the search? A "yes" or "no" or the actual line that matches, or just the number after `=`? Will you only be searching with a single string or with many separate strings (expecting many answers)? Do you care for substring matches (so that `Foo` matches not only `Foo` but also `AhFoo` and `Foobiz`, or `Hoo=Foo` etc.)? Are these variables that would be valid in a shell? Are there duplicated lines, or duplicated variable names? – Kusalananda Jan 08 '21 at 23:19
  • @Kusalananda I want the line (since I also want the variable value). I search only one string at a time (say Foo or Bar or Naz). I won't search for "Naz=" nor "42" nor "Naz=21" nor "Naz=42". I actually search the "full match" from line start (Foo matches Foo but not AhFoo nor Hoo=Foo); I don't care if it matches Foobiz: I'm not looking for it, but if it makes commander easier, it's fine – Xenos Jan 08 '21 at 23:24
  • 1
    see https://askubuntu.com/q/423886/10473 and https://unix.stackexchange.com/q/499306/4778 – ctrl-alt-delor Jan 08 '21 at 23:41
  • [Binary search in a sorted text file](https://unix.stackexchange.com/questions/247508/binary-search-in-a-sorted-text-file) – Eduardo Trápani Jan 08 '21 at 23:45
  • @ctrl-alt-delor Thanks, I didn't know `look` was actually what I looked for. I made it using `... | xargs -I "{}" look -f "{}" "sorted.txt"` which returns the result within a second. You may make an answer if you want me to accept it and get the reputation from it ;) Thanks again! – Xenos Jan 11 '21 at 16:05
  • I did not know ether. Shall we just mark as a duplicate? Add a comment, that says which question it is a duplicate of, then click close. (you will then get your points back) – ctrl-alt-delor Jan 12 '21 at 08:07
  • @ctrl-alt-delor This question can be considered as duplicate of https://askubuntu.com/q/423886/10473 but when I flag it as "duplicate" and tries to put the URL in "What question is this a duplicate of? ... is a duplicate of: https://askubuntu.com/q/423886/10473 " I get "The duplicate question must exist on Unix & Linux Stack Exchange" :/ What's the procedure I must follow? – Xenos Jan 13 '21 at 12:54
  • Ahh yes. Good point. – ctrl-alt-delor Jan 13 '21 at 20:50

1 Answers1

0

I don't know how this will scale to the volumes you're talking about, but it seems to work with a file containing this:

Foo=123
Foobar=646
Foobar=85489
Noo=8654
Noobar=8262
awk -F= '{if ($1 > "Foobar") { exit } ; if ($1 == "Foobar") { print $0 } }' sorted.txt

This is just a proof of concept. It would be a simple matter to adapt so the term you are matching against is passed in.

bxm
  • 4,561
  • 1
  • 20
  • 21
  • It didn't scale well, as it's taking more than minutes to run. I ended up using `look`, which I didn't know, from the comments in the question. Thanks anyway! – Xenos Jan 11 '21 at 16:03
  • Cool, glad you got there. – bxm Jan 12 '21 at 22:16