I have 25GB text file that needs a string replaced on only a few lines. I can use sed successfully but it takes a really long time to run.
sed -i 's|old text|new text|g' gigantic_file.sql
Is there a quicker way to do this?
I have 25GB text file that needs a string replaced on only a few lines. I can use sed successfully but it takes a really long time to run.
sed -i 's|old text|new text|g' gigantic_file.sql
Is there a quicker way to do this?
You can try:
sed -i '/old text/ s//new text/g' gigantic_file.sql
From this ref:
OPTIMIZING FOR SPEED: If execution speed needs to be increased (due to large input files or slow processors or hard disks), substitution will be executed more quickly if the "find" expression is specified before giving the "s/.../.../" instruction.
Here is a comparison over a 10G file. Before:
$ time sed -i 's/original/ketan/g' wiki10gb
real 5m14.823s
user 1m42.732s
sys 1m51.123s
After:
$ time sed -i '/ketan/ s//original/g' wiki10gb
real 4m33.141s
user 1m20.940s
sys 1m44.451s
The short answer is "No" - your limiting factor on this sort of operation is disk IO. There is no way to stream 25GB of a disk any faster. You might get a minor improvement if you don't inplace edit, and you write the result of the sed to a separate drive (if you have one available) - because that way you can be reading from one, whilst writing to another and there's slightly less contention as a result.
You might be able to speed it up a bit by not using the regex engine for each line - so for example using perl (I'm pretty sure you can do this with sed but I don't know the syntax) - this will start from line 10,000 onwards.
perl -pe '$. > 10_000 && s/old_text/new_text/g'
And if there's any sort of complications in the RE (metacharacters) then minimising those will slightly improve the efficiency of the regex engine.
If the new and old texts are the same length, you can seek into the file and write only the changed bytes, instead of copying the whole file. Otherwise you are trapped into moving lots of data.
Note: this is tricky and involves writing custom code.
See the man page for fseek if you're working in C or C++, or your favored language wrappers for the seek and write system calls.
If you insist on using the command line only, and you can get the byte offsets of the text, you can write the replacement text in place with carefully written "dd" commands.