Optimal command-line search inside a sorted text file

Question

Let's say I have a text file with billions of text lines sorted alphabetically, like

Bar=10
Foo=6
Naz=42

How can I search for the line starting with Foo in the most optimal way (the file contains billions of variables like this), knowing lines are sorted alphabetically and that the line I want to find must start (or "contain" if it's easier to search for) a specific text?

Edit:

This question can be considered as duplicate of https://askubuntu.com/q/423886/10473 Answer is to use look which is fast enough for such research

What do you want out of the search? A "yes" or "no" or the actual line that matches, or just the number after `=`? Will you only be searching with a single string or with many separate strings (expecting many answers)? Do you care for substring matches (so that `Foo` matches not only `Foo` but also `AhFoo` and `Foobiz`, or `Hoo=Foo` etc.)? Are these variables that would be valid in a shell? Are there duplicated lines, or duplicated variable names? — Kusalananda, Jan 08 '21 at 23:19
@Kusalananda I want the line (since I also want the variable value). I search only one string at a time (say Foo or Bar or Naz). I won't search for "Naz=" nor "42" nor "Naz=21" nor "Naz=42". I actually search the "full match" from line start (Foo matches Foo but not AhFoo nor Hoo=Foo); I don't care if it matches Foobiz: I'm not looking for it, but if it makes commander easier, it's fine — Xenos, Jan 08 '21 at 23:24
see https://askubuntu.com/q/423886/10473 and https://unix.stackexchange.com/q/499306/4778 — ctrl-alt-delor, Jan 08 '21 at 23:41
[Binary search in a sorted text file](https://unix.stackexchange.com/questions/247508/binary-search-in-a-sorted-text-file) — Eduardo Trápani, Jan 08 '21 at 23:45
@ctrl-alt-delor Thanks, I didn't know `look` was actually what I looked for. I made it using `... | xargs -I "{}" look -f "{}" "sorted.txt"` which returns the result within a second. You may make an answer if you want me to accept it and get the reputation from it ;) Thanks again! — Xenos, Jan 11 '21 at 16:05
I did not know ether. Shall we just mark as a duplicate? Add a comment, that says which question it is a duplicate of, then click close. (you will then get your points back) — ctrl-alt-delor, Jan 12 '21 at 08:07
@ctrl-alt-delor This question can be considered as duplicate of https://askubuntu.com/q/423886/10473 but when I flag it as "duplicate" and tries to put the URL in "What question is this a duplicate of? ... is a duplicate of: https://askubuntu.com/q/423886/10473 " I get "The duplicate question must exist on Unix & Linux Stack Exchange" :/ What's the procedure I must follow? — Xenos, Jan 13 '21 at 12:54

score 0 · Answer 1 · answered Jan 09 '21 at 11:29

0

I don't know how this will scale to the volumes you're talking about, but it seems to work with a file containing this:

Foo=123
Foobar=646
Foobar=85489
Noo=8654
Noobar=8262

awk -F= '{if ($1 > "Foobar") { exit } ; if ($1 == "Foobar") { print $0 } }' sorted.txt

This is just a proof of concept. It would be a simple matter to adapt so the term you are matching against is passed in.

answered Jan 09 '21 at 11:29

bxm

4,561
1
20
21

It didn't scale well, as it's taking more than minutes to run. I ended up using `look`, which I didn't know, from the comments in the question. Thanks anyway! – Xenos Jan 11 '21 at 16:03
Cool, glad you got there. – bxm Jan 12 '21 at 22:16

Optimal command-line search inside a sorted text file

1 Answers1