2

The 5GB files I have are streams of data rows formed:

    {datarow1...},{datarow2...},...,{datarowN...}

so actually could say that there are lines {}, and even line separators, but coming as a three char sequence: },{

I want to do two things:

  1. print "lines" that have string "error" in it:

    grep -o -P {[^{}]+?error.+?} ES01.log > ES01.err.log
    
  2. make the file more "friendly" by explicitly producing files with new line separators

    <ES01.log sed -e 's/},{/}\n{/g' > ESnl01.log
    

While the above works for relatively small files (up to ~100MB), my files are unfortunately a lot bigger therefore hitting the memory problems here:

    grep: memory exhausted
    sed: couldn't re-allocate memory

as both grep and sed try to read/process files line by line which in this case (no separators) leads to loading whole files into memory.

Any idea how to approach this using some another smart one-liner?

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
msciwoj
  • 371
  • 1
  • 3
  • 7

4 Answers4

2

With gawk:

gawk -v 'RS=},{' '{sub(",", "\n", RT); printf "%s", $0 RT}' < file

perl equivalent:

perl -pe 'BEGIN{$/="},{"}; s/\,{$/\n{/' < file

Otherwise, POSIXly:

tr , '\n' < file | awk '{
  if (/^{/ && e) print ""
  printf "%s", $0
  if (/}$/) e=1
  else {e=0; printf ","}}
  END {print ""}'

Pipe those to grep error to see the records with errors, and to paste -sd, - to restore to original format.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
1

You could also do this in Perl:

perl -ne 'BEGIN{$/="},{"} chomp; 
          s/\n$//; s/^{//; s/}$//; 
          print "{$_}\n"; ' k 

This is the same principle as the gawk one that StephaneChazelas suggested, in Perl, $/ is the record separator, so we set that to },{ to read the records correctly and then print them with newlines.

You could easily expand this to do both of the operations you ask for:

perl -i -ne 'BEGIN{$/="},{"}
             chomp; 
             s/\n$//; s/^{//; s/}$//; print "{$_}\n"; 
             print STDERR "{$_}\n" if /error/' file 2> ES01.err.log
terdon
  • 234,489
  • 66
  • 447
  • 667
0

If you are willing to try a program that is probably not yet installed on your system, try gsar, explained in this answer to the same problem.

gsar is a search and (optionally) replace utility that operates on binary files. It cannot however search with regular expressions.

This command:

gsar '-s},{' '-r}:x0A{' ES01.log > ESnl01.log

replaces the comma between }{ with a newline character, reading from ES01.log and redirecting output to ESnl01.log.

The search (-s) and replacement (-r) strings do not be of the same length.

MattBianco
  • 3,676
  • 6
  • 27
  • 43
0

You could do this simply through Perl using regex.

perl -pe 's/(?<=}),(?=\{)/\n/g' file
Avinash Raj
  • 3,653
  • 4
  • 20
  • 34