Replace text quickly in very large file

Question

I have 25GB text file that needs a string replaced on only a few lines. I can use sed successfully but it takes a really long time to run.

sed -i 's|old text|new text|g' gigantic_file.sql

Is there a quicker way to do this?

Do you know the line numbers where the text to replace is? If not your only option for speeding it up is to get a faster computer. The fact that you have a large amount of data means it will take a large amount of time to search through it. — David King, Jan 14 '16 at 19:27
You can also use multiple CPU cores to speed it up - http://www.rankfocus.com/use-cpu-cores-linux-commands/ — ahaswer, Feb 14 '16 at 20:30
Don't use sed for large files. Take a look at [vi or vim](https://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files) instead. — MikeJRamsey56, Feb 26 '19 at 19:27

score 48 · Answer 1 · edited Mar 24 '17 at 14:43

48

You can try:

sed -i '/old text/ s//new text/g' gigantic_file.sql

From this ref:

OPTIMIZING FOR SPEED: If execution speed needs to be increased (due to large input files or slow processors or hard disks), substitution will be executed more quickly if the "find" expression is specified before giving the "s/.../.../" instruction.

Here is a comparison over a 10G file. Before:

$ time sed -i 's/original/ketan/g' wiki10gb
real    5m14.823s
user    1m42.732s
sys     1m51.123s

After:

$ time sed -i '/ketan/ s//original/g' wiki10gb
real    4m33.141s
user    1m20.940s
sys     1m44.451s

edited Mar 24 '17 at 14:43

xhienne

17,075
2
52
68

answered Feb 14 '16 at 20:38

Ketan Maheshwari

9,054
6
40
53

The last `sed` is misspelled. I edited this post yesterday to fix the last `sed` command which should be `time sed -i '/original/ s//ketan/g' wiki10gb` and not `time sed -i '/ketan/ s//original/g' wiki10gb`. I'm reverting my edit today because 1. the times no longer match the command and 2. I have done the same test with GNU sed on a 3+ GB file and I do not observe any difference between the two `sed` alternatives. I suspect that the difference in times is due to the misspelling. – xhienne Mar 24 '17 at 14:42
@xhienne I am not sure what you mean by misspelling. In the first run, I am substituting the word 'original' with 'ketan' and in the second one I am substituting term 'ketan' with the term 'original' resulting in equal number of substitutions in either case. – Ketan Maheshwari Mar 24 '17 at 17:59
1

I was applying a "fix" reported by a new user with not enough reputation. Now I understand what you did. However, if you want to prove that one syntax is better than one another, you have to do the exact same operation which is not the case here (CPU-wise, looking for a 5-char string is not the same as looking for a 7-char string). Moreover, this kind of test on a 10GB file is heavily dependent on your machine load (CPU, disk). I saw a lot of fluctuations in the `time` results personally, but all in all, there was no difference in time. – xhienne Mar 24 '17 at 18:58
I believe this is related -- see the accepted answer here, https://stackoverflow.com/questions/11145270/how-to-replace-an-entire-line-in-a-text-file-by-line-number >> sed streams the entire file, but as noted in this answer, specifying the line number (if known) helps: in my case, a ~2-fold increase in execution speed (GNU sed 4.5). You can grep -n or ripgrep (rg) to find line numbers, based on pattern searches. In effect, specifying the line number is like having a search result on that file, per the answer above. – Victoria Stuart Apr 24 '18 at 19:05

score 2 · Answer 2 · answered Jan 14 '16 at 19:29

2

The short answer is "No" - your limiting factor on this sort of operation is disk IO. There is no way to stream 25GB of a disk any faster. You might get a minor improvement if you don't inplace edit, and you write the result of the sed to a separate drive (if you have one available) - because that way you can be reading from one, whilst writing to another and there's slightly less contention as a result.

You might be able to speed it up a bit by not using the regex engine for each line - so for example using perl (I'm pretty sure you can do this with sed but I don't know the syntax) - this will start from line 10,000 onwards.

perl -pe '$. > 10_000 && s/old_text/new_text/g'

And if there's any sort of complications in the RE (metacharacters) then minimising those will slightly improve the efficiency of the regex engine.

answered Jan 14 '16 at 19:29

Sobrique

4,404
14
24

4

In sed that would be `sed -i '10000,$ s/old_text/new_text/g'` – Dani_l Jan 14 '16 at 19:40
Lovely. I don't know how `sed` compares - I assume marginally faster, but not much because of the file size. – Sobrique Jan 14 '16 at 20:00
I'd assume perl is faster than sed, but sed is somewhat less cryptic, or rather requires less of an initial learning curve. – Dani_l Jan 14 '16 at 20:02
2

See, now I'd have said the opposite - you can (almost) write `sed` in `perl`, but the latter also lets you write more verbose scripts too. – Sobrique Jan 14 '16 at 20:05

score 0 · Answer 3 · answered Mar 24 '17 at 03:52

If the new and old texts are the same length, you can seek into the file and write only the changed bytes, instead of copying the whole file. Otherwise you are trapped into moving lots of data.

Note: this is tricky and involves writing custom code.

See the man page for fseek if you're working in C or C++, or your favored language wrappers for the seek and write system calls.

If you insist on using the command line only, and you can get the byte offsets of the text, you can write the replacement text in place with carefully written "dd" commands.

Replace text quickly in very large file

3 Answers3