Replace string containing newline in huge file

Question

Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.

I have a +2GB text file that I would like to process similar to what this appears to do:

sed -e 's/>\n/>/g'

That means, I want to remove all newlines that occur after a >, but not anywhere else, so that rules out tr -d.

This command (that I got from the answer of a similar question) fails with couldn't re-allocate memory :

sed --unbuffered ':a;N;$!ba;s/>\n/>/g'

So, are there any other methods without resorting to C? I hate perl, but am willing to make an exception in this case :-)

I don't know for sure of any character that does not occur in the data, so temporary replacing \n with another character is something I'd like to avoid if possible.

Any good ideas, anyone?

What is wrong with the first sed command. The second seems to be reading everything into pattern space, I don't know that the `$!` is though. This I expect will need a **LOT** of memory. — ctrl-alt-delor, Jun 16 '14 at 12:35
The problem is that sed reads everything as lines, that's why the first command doesn't remove the newlines, since it outputs the text row-by-row again. The second command is just a workaround. I think `sed` is not the proper tool in this case. — MattBianco, Jun 16 '14 at 12:40
`sed` is the perfect tool for this case - but `$!` loops back to `b`ranch `:a` until it reaches the last line. Look at steeldriver's answer - his keeps 2 lines in memory as opposed to 2gbs. — mikeserv, Jun 16 '14 at 13:42
@MattBianco, if you are looking for a different solution, you are better to add a separate question. — Graeme, Jun 16 '14 at 14:21
I ended up using `gsar` [like this](http://unix.stackexchange.com/a/137600/5923). — MattBianco, Jun 17 '14 at 12:20

terdon · Accepted Answer · 2014-06-16T12:51:33.707

14

This really is trivial in Perl, you shouldn't hate it!

perl -i.bak -pe 's/>\n/>/' file

Explanation

-i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
-pe : read the input file line by line and print each line after applying the script given as -e.
s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2

edited Jun 16 '14 at 12:51

answered Jun 16 '14 at 12:43

terdon

234,489
66
447
667

3

+1. awk golf: `awk '{ORS=/>$/?"":"\n"}1'` – glenn jackman Jun 16 '14 at 13:03
1

Why I dislike perl in general is the same reason why I chose this answer (or actually your comment to Gnouc's answer): readability. Using perl -pe with a simple "sed pattern" is way more readable than a complex sed-expression. – MattBianco Jun 16 '14 at 13:05
3

@MattBianco fair enough but, just so you know, that has nothing to do with Perl. The lookbehind that Gnouc used is a feature of some regular expression languages (including, but not limited to, PCREs), not Perl's fault at all. Also, after featuring this sed monstrosity `':a;N;$!ba;s/>\n/>/g'` in your question, you've waived your right to complain about readability! :P – terdon Jun 16 '14 at 13:21
@glennjackman nice! I was playing with the `foo ? bar : baz` construct but couldn't get it to work. – terdon Jun 16 '14 at 13:22
@terdon: Yeap, my mistake. Delete it. – cuonglm Jun 16 '14 at 13:44
@terdon I never claimed to understand the sed monstrosity I put in the question. What I wanted was the first sed expression, which works fine with perl. However, even perl seems to run out of memory sometimes. Does it not work on "streams"? Very strange. I just got `Out of memory!` :-( This is when invoked in a pipe, without `-i` – MattBianco Jun 16 '14 at 14:00
@MattBianco huh, that is strange. The out of memory is probably due to however your system is buffering the pipe though. The perl command reads line by line so there should be no memory issues there. – terdon Jun 16 '14 at 14:07
@terdon well.. I continued building my script, and replaced all other newlines with a string as a next step. It was when dealing with that file I ran out of memory, since the line was then very very long. Are there no simple search-and-replace string tools from the unix days that are not line-oriented? – MattBianco Jun 16 '14 at 14:16
@MattBianco not that I know of (but there may be regardless). However, I really don't see how this perl snippet could possibly run out of memory since it never holds more than a single line in memory. I'm guessing it's your shell that is running out because of the way the pipe is being buffered. You might want to post a question explaining your entire workflow so we can help you with your final objective rather than each small step. – terdon Jun 16 '14 at 14:48
Monstrous is right - it was 2.5gbs! – mikeserv Jun 16 '14 at 22:17
I accept this answer because it was simple, readable, and did what I asked for. But I ended up using `gsar` which I needed for my other problem, explained in [this answer](http://unix.stackexchange.com/a/137600/5923). – MattBianco Jun 17 '14 at 12:19

cuonglm · Answer 2 · 2014-06-16T13:23:53.993

7

A perl solution:

$ perl -pe 's/(?<=>)\n//'

Explaination

s/// is used for string substitution.
(?<=>) is lookbehind pattern.
\n matches newline.

The whole pattern meanings removing all newline that have > before it.

edited Jun 16 '14 at 13:23

answered Jun 16 '14 at 12:30

cuonglm

150,973
38
327
406

2

care to comment what the parts of the program does? I'm always looking to learn. – MattBianco Jun 16 '14 at 12:32
2

Why bother with the lookbehind? Why not just `s/>\n/>/`? – terdon Jun 16 '14 at 12:44
1

or `s/>\K\n//` would also work – glenn jackman Jun 16 '14 at 13:00
@terdon: Just the first thing I though, remove instead of replace – cuonglm Jun 16 '14 at 13:13
@glennjackman: good point! – cuonglm Jun 16 '14 at 13:14
@MattBianco: Sorry, I have some works while writting my answer. I updated it. – cuonglm Jun 16 '14 at 13:24

Graeme · Answer 3 · 2014-06-16T13:20:42.857

3

How about this:

sed ':loop
  />$/ { N
    s/\n//
    b loop
  }' file

For GNU sed, you can also try adding the -u (--unbuffered) option as per the question. GNU sed is also happy with this as a simple one-liner:

sed ':loop />$/ { N; s/\n//; b loop }' file

edited Jun 16 '14 at 13:20

answered Jun 16 '14 at 12:41

Graeme

33,607
8
85
110

That doesn't remove the last `\n` if the file ends in `>\n`, but that's probably preferable anyway. – Stéphane Chazelas Jun 16 '14 at 12:49
@StéphaneChazelas, why does the closing `}` need to be in a separate expression? will this not work as a multiline expression? – Graeme Jun 16 '14 at 12:56
1

That will work in POSIX seds with `b loop\n}` or `-e 'b loop' -e '}'` but not as `b loop;}` and certainly not as `b loop}` because `}` and `;` are valid in label names (though nobody in their right mind would use it. And that means GNU sed is not POSIX conformant) and the `}` command needs to be separated from the `b` command. – Stéphane Chazelas Jun 16 '14 at 13:00
@StéphaneChazelas, GNU `sed` is happy with all of the above even with `--posix`! The standard also has the following for brace expressions - `The list of sed functions shall be surrounded by braces and separated by s`. Does this not mean that semicolons should only be used outside of braces? – Graeme Jun 16 '14 at 13:15
@mikeserv, the loop is needed to handle consecutive lines ending in `>`. The original never had one, this was pointed out by Stéphane. – Graeme Jun 16 '14 at 13:58
@mikeserv, that's not the problem. The problem is that when you do the `N` the next line is removed from the input, so the only way to catch consecutive lines ending in `>` is to apply the regex to the patter buffer again. Try it and see `echo -e 'one\ntwo>\nthree>\nfour\nfive' | sed '/>$/!b;N;s/\n//'`. – Graeme Jun 16 '14 at 14:09
I know - that's why I deleted it. It will work if you swap it around though and use hold space: sed `H;s/.*//;x;/>$/{s/\n//;h;d}` - but the hold space would require you to clean it. You're better off that way. – mikeserv Jun 16 '14 at 19:05
@mikserv, I'm not sure how that one is working but it doesn't catch the last in the sequence of lines to be joined. Also I am getting blank lines inserted - `echo -e 'one>\ntwo>\nthree\nfour\nfive>\nsix' | sed 'H;s/.*//;x;/>$/{s/\n//;h;d}'` – Graeme Jun 16 '14 at 19:25

steeldriver · Answer 4 · 2014-06-16T14:38:02.467

You should be able to use sed with the N command, but the trick will be to delete one line from the pattern space each time that you add another (so that the pattern space always contains only 2 consecutive lines, instead of trying to read in the whole file) - try

sed ':a;$!N;s/>\n/>/;P;D;ba'

EDIT: after re-reading Peteris Krumins' Famous Sed One-Liners Explained I believe a better sed solution would be

sed -e :a -e '/>$/N; s/\n//; ta'

which only appends the following line in the case that it's already made a > match at the end, and should conditionally loop back to handle the case of consecutive matching lines (it is Krumin's 39. Append a line to the next if it ends with a backslash "\" exactly except for the substitution of > for \ as the join character, and the fact that the join character is retained in the output).

That doesn't work if 2 consecutive lines end in `>` (that's also GNU specific) — Stéphane Chazelas, Jun 16 '14 at 12:51

score 1 · Answer 5 · answered Jun 17 '14 at 00:49

1

sed doesn't provide a way to emit output without a final newline. Your approach using N fundamentally works, but stores incomplete lines in memory, and thus can fail if the lines become too long (sed implentations aren't typically designed to handle extremely long lines).

You can use awk instead.

awk '{if (/<$/) printf "%s", $0; else print}'

An alternative approach is to use tr to swap the newline character with a “boring”, frequently-occurring character. Space might work here — pick a character that tends to appear on every line or at least a large proportion of lines in your data.

tr ' \n' '\n ' | sed 's/> />/g' | tr '\n ' ' \n'

answered Jun 17 '14 at 00:49

Gilles 'SO- stop being evil'

807,993
194
1,674
2,175

Both methods are already demonstrated here to better effect in other answers. And his approach with `sed` does not work without a 2.5gigabyte buffer. – mikeserv Jun 17 '14 at 02:29
Did anybody mention awk? Oh, I missed it, I'd only noticed perl in terdon's answer for some reason. Nobody mentioned the `tr` approach — mikeserv, you posted a different (valid, but less generic) approach that happens to also use `tr`. – Gilles 'SO- stop being evil' Jun 17 '14 at 07:55
*valid, but less generic* sounds to me like youve just called it a *working, targeted solution.* i think its hard to argue that such a thing isnt *useful* which is odd because it has 0 upvotes. The biggest difference i can see between my own solution and your *more generic* offering, is that mine *specifically* solves a problem, whereas yours *might generally.* That might make it worthwhile - and i may even reverse my vote - but theres also the pesky matter of the 7 hours between them and the *recurring* theme of your answers mimicking others. Can you explain this? – mikeserv Jun 17 '14 at 16:48

andrej · Answer 6 · 2014-11-19T09:22:31.967

1

what about using ed?

ed -s test.txt <<< $'/fruits/s/apple/banana/g\nw'

(via http://wiki.bash-hackers.org/howto/edit-ed)

edited Nov 19 '14 at 09:22

answered Oct 10 '14 at 13:48

andrej

171
4

edited, there is no dependency on website anymore – andrej Nov 19 '14 at 09:23

score 0 · Answer 7 · edited Apr 13 '17 at 12:36

0

I ended up using gsar as described in this answer like this:

gsar -F '-s>:x0A' '-r>'

edited Apr 13 '17 at 12:36

Community

1

answered Sep 19 '14 at 14:06

MattBianco

3,676
6
27
43

score -1 · Answer 8 · answered Jun 16 '14 at 18:59

-1

There are a lot of ways to do this, and most here are really good, but I think this one's my favorite:

tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/'

Or even:

tr '>\n' '\n>' | sed 's/^>*//' | tr '\n>' '>\n'

answered Jun 16 '14 at 18:59

mikeserv

57,448
9
113
229

I can’t get your first answer to work at all. While I admire the elegance of the second one, I believe that you need to remove the `*`. The way it is now, it will delete any blank lines following a line that ends with a `>`. … Hmm. Looking back at the question, I see that it’s a little ambiguous. The question says, “I want to remove all newlines that occur after a `>`, …” I interpret that to mean that `>\n\n\n\n\nfoo` should be changed to `\n\n\n\nfoo`, but I suppose `foo` might be the desired output. – Scott - Слава Україні Jun 17 '14 at 18:55
@Scott - I tested with variations on the following: `printf '>\n>\n\n>>\n>\n>>>\n>\nf\n\nff\n>\n' | tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/'` - that results in `>>>>>>>>>>f\n\nff\n\n` for me with the first answer. I am curious though what you're doing to break it though, because I'd like to fix it. As to the second point - I don't agree that it is ambiguous. The OP does not ask to remove *all* `>` *preceding* a `\n`ewline, but instead to remove *all* `\n`ewlines *following* a `>`. – mikeserv Jun 17 '14 at 19:01
1

Yes, but a valid interpretation is that, in `>\n\n\n\n\n`, only the first newline is after a `>`; all the others are following other newlines. Note that the OP’s “this is what I want, if only it worked” suggestion was `sed -e 's/>\n/>/g'`, not `sed -e 's/>\n*/>/g'`. – Scott - Слава Україні Jun 17 '14 at 19:09
1

@Scott - the suggestion did not work and never could. I don't believe that the code suggestion of someone who does not fully understand the code can be considered as valid an interpreting point as the plain language that person also uses. And besides, the output - if it actually worked - of `s/>\n/>/` on `>\n\n\n\n\n` would still be something that `s/>\n/>/` would edit. – mikeserv Jun 17 '14 at 19:11

Replace string containing newline in huge file

8 Answers8

Explanation

Linked