6

I have a 4TB big text file Exported from Teradata records, and I want to know how many records (= lines in my case) there are in that file.

How may I do this quickly and efficiently?

AdminBee
  • 21,637
  • 21
  • 47
  • 71
Santosh Garole
  • 358
  • 1
  • 4
  • 16

6 Answers6

8

If this information is not already present as meta data in a separate file (or embedded in the data, or available through a query to the system that you exported the data from) and if there is no index file of some description available, then the quickest way to count the number of lines is by using wc -l on the file.

You can not really do it quicker.

To count the number of records in the file, you will have to know what record separator is in used and use something like awk to count these. Again, that is if this information is not already stored elsewhere as meta data and if it's not available through a query to the originating system, and if the records themselves are not already enumerated and sorted within the file.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
3

So here is a speed test between awk and wc

67G test.tsv

time awk 'END {print NR}' test.tsv; time wc -l test.tsv

809162924

real    2m22.713s 
user    1m46.712s 
sys     0m19.618s 

809162924 test.tsv

real    0m20.222s 
user    0m9.629s 
sys     0m10.592s

Another file 72G Sample.sam

time awk 'END {print NR}' Sample.sam; time wc -l Sample.sam
180824516

real    1m18.022s
user    1m5.775s
sys     0m12.238s

180824516 Sample.sam

real    0m22.534s
user    0m4.599s
sys     0m17.921s
Onkar
  • 31
  • 1
  • For me `awk` is slower but `gawk` works fine and faster. Here is for a 27G file on Solaris Unix. `time wc -l date.txt real 1m39.916s user 1m32.976s sys 0m6.939s ` And for gawk `time gawk 'END {print NR}' data.txt real 0m24.858s user 0m17.220s sys 0m7.641s ` – EsmaeelE Dec 11 '22 at 06:22
1

You should not use line based utilities such as awk and sed. These utilities will issue a read() system call for every line in the input file (see that answer on why this is so). If you have lots of lines, this will be a huge performance loss.

Since your file is 4TB in size, I guess that there are a lot of lines. So even wc -l will produce a lot of read() system calls, since it reads only 16384 bytes per call (on my system). Anyway this would be an improvement over awk and sed. The best method - unless you write your own program - might be just

cat file | wc -l

This is no useless use of cat, because cat reads chunks of 131072 bytes per read() system call (on my system) and wc -l will issue more, but not on the file directly, instead on the pipe. But however, cat tries to read as much as possible per system call.

chaos
  • 47,463
  • 11
  • 118
  • 144
  • Won't an io redirect be faster than `cat` and pipe ? – pLumo Mar 07 '19 at 12:22
  • @RoVo Could be, have you tried it? – chaos Mar 07 '19 at 12:25
  • 2
    Short test with 10 iterations of `wc -l` with a 701MB file: `wc -l file` 1.7s ;; `wc -l < file` 1.7s ;; `cat file | wc -l` 2.6s. – pLumo Mar 07 '19 at 12:28
  • 1
    "These utilities will issue a read() system call for every line in the input file" -- That can't be true. `read()` only reads a bunch of bytes, it doesn't know how to read a line. The utilities might differ in the size of a buffer they use for `read()`, but that's not the same. It's likely that most utilities will read at least a couple of kB in one go, and that's usually enough for a few lines at minimum. – ilkkachu Nov 29 '19 at 09:41
  • But with `cat file | wc -l`, `wc` will still do its 16k `read()`s, this time on a pipe, and `cat` will do extra writes to that pipe, and the kernel will have to do extra work to shove bytes through that pipe, I can't see how that can improve matters. – Stéphane Chazelas Dec 22 '21 at 08:12
1

Looping over files is a job for AWK ... nothing can beat this speed

LINECOUNT=`awk '{next}; END { print FNR }' $FILE

[root@vmd28527 bin]# time LINECOUNT=`awk '{next}; END { print FNR }' $FILE`; echo $LINECOUNT

real    0m0.005s
user    0m0.001s
sys     0m0.004s
7168

5 msec for 7168 lines ... not bad ...

roaima
  • 107,089
  • 14
  • 139
  • 261
  • Personally, I find `awk` one of the slower tools. You may find that `wc -l "$FILE"` is significantly faster (almost double the speed in my tests) – roaima Jul 14 '21 at 05:19
  • You are right. For the simple purpose just counting lines of line-oriented file 'wc -l' is unbeatable in speed. But if it comes to traversing files and doing complex things then AWK is the 'tool-of-choice'. AWK can outperform a shell solution (even BASH with a lot of built-in functions) by a factor of 100 or even far more depending on the complexity of the task ... – Heinz-Peter Heidinger Jul 14 '21 at 09:32
1

I also did a speed comparison on a large VCF text file. Here is what I found:

216GB VCF text file (on a single SSD)

$ time wc -l <my_big_file>
16695620 

real    1m26.912s
user    0m2.896s
sys     1m23.002s
$ tail -5 <my_big_file>
$ time fgrep -n <last_line_pattern>  <my_big_file>
16695620:<last_line_pattern>

real    2m10.154s
user    0m46.938s
sys     1m22.492s
$ tail -5 <my_big_file>
$ LC_ALL=C && time fgrep -n <last_line_pattern>  <my_big_file>
16695620:<last_line_pattern>

real    1m38.153s
user    0m45.863s
sys     0m51.944s

And, finally:

$ time awk 'END {print NR}' <my_big_file>
16695620

real    1m44.074s
user    1m11.275s
sys     0m32.780s

CONCLUSION 1:

  • wc -l seems fastest with SSD.

216GB VCF text file (on a RAID10 setup with 8 HDDs)

$ time wc -l <my_big_file>
16695620 

real    7m22.397s
user    0m10.562s
sys 4m1.888s
$ tail -5 <my_big_file>
$ time fgrep -n <last_line_pattern>  <my_big_file>
16695620:<last_line_pattern>

real    7m7.812s
user    1m58.242s
sys 3m12.355s
$ tail -5 <my_big_file>
$ LC_ALL=C && time fgrep -n <last_line_pattern>  <my_big_file>
16695620:<last_line_pattern>

real    4m34.522s
user    1m26.764s
sys 1m58.247s

Finally:

$ time awk 'END {print NR}' <my_big_file>
16695620

real    6m50.240s
user    2m37.574s
sys 2m43.498s

CONCLUSION 2:

  • wc -l seems fairly comparable to others.
  • The lower time of LC_ALL=C && time fgrep -n <last_line_pattern> may well be due to caching, as subsequent wc -l also show lower timing.
ak19
  • 11
  • 2
-2

Below is what worked for me, tail -5 the file and then grep the text in your last line with -n option in grep...

tail -5 "filename"

LC_ALL=C fgrep -n "text in yourlast line" "filename"
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
The Noob
  • 17
  • 1
  • 5
  • 2
    How does help speed up counting the number of lines in a huge file? You’re reading the file *twice*. – Stephen Kitt Nov 29 '19 at 08:53
  • @StephenKitt, I'm not so sure about that. `tail` might well be smart enough to start reading from the end of the file. Of course that doesn't make it any more useful to do all that unnecessary work with the `grep`, or help with the fact that the text in the last line might also appear elsewhere in the file. – ilkkachu Nov 29 '19 at 09:43
  • @ilkkachu ah yes, `tail` can indeed work backwards (and the GNU version does). If the text appears multiple times, `fgrep` will show multiple matches, but will still show the last line. The results might not be accurate if the last line’s contents aren’t obvious from `tail`’s output (*e.g.* an empty line or a line containing only whitespace). – Stephen Kitt Nov 29 '19 at 09:56