I have a 4TB big text file Exported from Teradata records, and I want to know how many records (= lines in my case) there are in that file.
How may I do this quickly and efficiently?
I have a 4TB big text file Exported from Teradata records, and I want to know how many records (= lines in my case) there are in that file.
How may I do this quickly and efficiently?
If this information is not already present as meta data in a separate file (or embedded in the data, or available through a query to the system that you exported the data from) and if there is no index file of some description available, then the quickest way to count the number of lines is by using wc -l on the file.
You can not really do it quicker.
To count the number of records in the file, you will have to know what record separator is in used and use something like awk to count these. Again, that is if this information is not already stored elsewhere as meta data and if it's not available through a query to the originating system, and if the records themselves are not already enumerated and sorted within the file.
So here is a speed test between awk and wc
67G test.tsv
time awk 'END {print NR}' test.tsv; time wc -l test.tsv
809162924
real 2m22.713s
user 1m46.712s
sys 0m19.618s
809162924 test.tsv
real 0m20.222s
user 0m9.629s
sys 0m10.592s
Another file 72G Sample.sam
time awk 'END {print NR}' Sample.sam; time wc -l Sample.sam
180824516
real 1m18.022s
user 1m5.775s
sys 0m12.238s
180824516 Sample.sam
real 0m22.534s
user 0m4.599s
sys 0m17.921s
You should not use line based utilities such as awk and sed. These utilities will issue a read() system call for every line in the input file (see that answer on why this is so). If you have lots of lines, this will be a huge performance loss.
Since your file is 4TB in size, I guess that there are a lot of lines. So even wc -l will produce a lot of read() system calls, since it reads only 16384 bytes per call (on my system). Anyway this would be an improvement over awk and sed. The best method - unless you write your own program - might be just
cat file | wc -l
This is no useless use of cat, because cat reads chunks of 131072 bytes per read() system call (on my system) and wc -l will issue more, but not on the file directly, instead on the pipe. But however, cat tries to read as much as possible per system call.
Looping over files is a job for AWK ... nothing can beat this speed
LINECOUNT=`awk '{next}; END { print FNR }' $FILE
[root@vmd28527 bin]# time LINECOUNT=`awk '{next}; END { print FNR }' $FILE`; echo $LINECOUNT
real 0m0.005s
user 0m0.001s
sys 0m0.004s
7168
5 msec for 7168 lines ... not bad ...
I also did a speed comparison on a large VCF text file. Here is what I found:
216GB VCF text file (on a single SSD)
$ time wc -l <my_big_file>
16695620
real 1m26.912s
user 0m2.896s
sys 1m23.002s
$ tail -5 <my_big_file>
$ time fgrep -n <last_line_pattern> <my_big_file>
16695620:<last_line_pattern>
real 2m10.154s
user 0m46.938s
sys 1m22.492s
$ tail -5 <my_big_file>
$ LC_ALL=C && time fgrep -n <last_line_pattern> <my_big_file>
16695620:<last_line_pattern>
real 1m38.153s
user 0m45.863s
sys 0m51.944s
And, finally:
$ time awk 'END {print NR}' <my_big_file>
16695620
real 1m44.074s
user 1m11.275s
sys 0m32.780s
CONCLUSION 1:
wc -l seems fastest with SSD.216GB VCF text file (on a RAID10 setup with 8 HDDs)
$ time wc -l <my_big_file>
16695620
real 7m22.397s
user 0m10.562s
sys 4m1.888s
$ tail -5 <my_big_file>
$ time fgrep -n <last_line_pattern> <my_big_file>
16695620:<last_line_pattern>
real 7m7.812s
user 1m58.242s
sys 3m12.355s
$ tail -5 <my_big_file>
$ LC_ALL=C && time fgrep -n <last_line_pattern> <my_big_file>
16695620:<last_line_pattern>
real 4m34.522s
user 1m26.764s
sys 1m58.247s
Finally:
$ time awk 'END {print NR}' <my_big_file>
16695620
real 6m50.240s
user 2m37.574s
sys 2m43.498s
CONCLUSION 2:
wc -l seems fairly comparable to others.LC_ALL=C && time fgrep -n <last_line_pattern> may well be due to caching, as subsequent wc -l also show lower timing.Below is what worked for me, tail -5 the file and then grep the text in your last line with -n option in grep...
tail -5 "filename"
LC_ALL=C fgrep -n "text in yourlast line" "filename"