17

I have a big sorted file with billions of lines of variable lengths. Given a new line I would like to know which byte number it would get if it had been included in the sorted file.

Example

a\n
c\n
d\n
f\n
g\n

Given the input 'foo' I would get the output 9.

This is easy to do by simply going through the whole file, but being billions of lines of variable lengths it would be faster to do a binary search.

Does such a text processing tool already exist?

Edit:

It does now: https://gitlab.com/ole.tange/tangetools/blob/master/2search

Ole Tange
  • 33,591
  • 31
  • 102
  • 198

4 Answers4

6

(This is not a correct answer to your question, just a starting point.)

I used sgrep (sorted grep) in a similar situation.

Unfortunately (we need the current state) it does not have a byte-offset output; but I think it could be easily added.

JJoao
  • 11,887
  • 1
  • 22
  • 44
4

I'm not aware of some standard tool doing this. However you can write your own. For example the following ruby script should do the job.

file, key = ARGV.shift, ARGV.shift
min, max = 0, File.size(file)

File.open(file) do |f|
  while max-min>1 do
    middle = (max+min)/2
    f.seek middle
    f.readline
    if f.eof? or f.readline>=key
      max = middle
    else
      min = middle
    end
  end
  f.seek max
  f.readline
  p f.pos+1
end

It's a bit tricky because after the seek you are usually in the middle of some line and therefore need to do one readline to get to the beginning of the following line, which you can read and compare to your key.

michas
  • 21,190
  • 4
  • 63
  • 93
  • Can it be altered to accept -n/-r to process files sorted by `sort -r` and `sort -n`? – Ole Tange Dec 05 '15 at 19:40
  • The code above is mainly to show the idea. It is far from perfect. (For example it fails if key goes to first place.) Feel free to adapt to your needs. – michas Dec 05 '15 at 19:52
2

Based on Michas solution here is a more complete program:

https://gitlab.com/ole.tange/tangetools/-/tree/master/2search

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
1

From a very large, sorted log file I often want to extract all records after a given date. Reading the whole file looking for the date linearly takes much too long.

A dozen years ago I hastily modified look to have two new options to make this easy:

-a: print all lines after the target line
-n: print nearest match if target is not found

Assuming the log file is sorted, then look -b -a -n can do a very fast binary search to a given date (or to the line closest to the date), and then output all the records from that point to the end of the file.

Surely in the past dozen years someone else has done this better than I did?

Ian D. Allen
  • 885
  • 6
  • 11