Binary search in a sorted text file

Question

I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.

Example

a\n
c\n
d\n
f\n
g\n

Given the input 'foo' I would get the output 9.

This is easy to do by simply going through the whole file, but being billions of lines of variable lengths it would be faster to do a binary search.

Does such a text processing tool already exist?

Edit:

It does now: https://gitlab.com/ole.tange/tangetools/blob/master/2search

how long is the line that you are searching for (in characters)? and how many such lines do you need to search for? — gogoud, Dec 05 '15 at 09:49
@gogoud I am not looking for a limited tool, but one that works on any textfile (no matter the line length or number of lines). — Ole Tange, Dec 05 '15 at 19:23
for those that might like to generate such gigantic input: http://unix.stackexchange.com/a/279098/9689 — Grzegorz Wierzowiecki, Apr 26 '16 at 07:44

JJoao · Answer 1 · 2015-12-05T10:43:14.110

6

(This is not a correct answer to your question, just a starting point.)

I used sgrep (sorted grep) in a similar situation.

Unfortunately (we need the current state) it does not have a byte-offset output; but I think it could be easily added.

edited Dec 05 '15 at 10:43

answered Dec 05 '15 at 10:32

JJoao

11,887
1
22
44

This seems to be very fast – Metamorphic May 15 '20 at 09:48

michas · Accepted Answer · 2015-12-05T13:13:30.277

4

I'm not aware of some standard tool doing this. However you can write your own. For example the following ruby script should do the job.

file, key = ARGV.shift, ARGV.shift
min, max = 0, File.size(file)

File.open(file) do |f|
  while max-min>1 do
    middle = (max+min)/2
    f.seek middle
    f.readline
    if f.eof? or f.readline>=key
      max = middle
    else
      min = middle
    end
  end
  f.seek max
  f.readline
  p f.pos+1
end

It's a bit tricky because after the seek you are usually in the middle of some line and therefore need to do one readline to get to the beginning of the following line, which you can read and compare to your key.

edited Dec 05 '15 at 13:13

answered Dec 05 '15 at 12:33

michas

21,190
4
63
93

Can it be altered to accept -n/-r to process files sorted by `sort -r` and `sort -n`? – Ole Tange Dec 05 '15 at 19:40
The code above is mainly to show the idea. It is far from perfect. (For example it fails if key goes to first place.) Feel free to adapt to your needs. – michas Dec 05 '15 at 19:52

Ole Tange · Answer 3 · 2020-05-15T10:40:18.143

2

Based on Michas solution here is a more complete program:

https://gitlab.com/ole.tange/tangetools/-/tree/master/2search

edited May 15 '20 at 10:40

answered Aug 06 '17 at 06:46

Ole Tange

33,591
31
102
198

score 1 · Answer 4 · answered Sep 25 '22 at 04:52

From a very large, sorted log file I often want to extract all records after a given date. Reading the whole file looking for the date linearly takes much too long.

A dozen years ago I hastily modified look to have two new options to make this easy:

-a: print all lines after the target line
-n: print nearest match if target is not found

Assuming the log file is sorted, then look -b -a -n can do a very fast binary search to a given date (or to the line closest to the date), and then output all the records from that point to the end of the file.

Surely in the past dozen years someone else has done this better than I did?

Binary search in a sorted text file

4 Answers4

Linked