17

Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.

I have a +2GB text file that I would like to process similar to what this appears to do:

sed -e 's/>\n/>/g'

That means, I want to remove all newlines that occur after a >, but not anywhere else, so that rules out tr -d.

This command (that I got from the answer of a similar question) fails with couldn't re-allocate memory :

sed --unbuffered ':a;N;$!ba;s/>\n/>/g'

So, are there any other methods without resorting to C? I hate perl, but am willing to make an exception in this case :-)

I don't know for sure of any character that does not occur in the data, so temporary replacing \n with another character is something I'd like to avoid if possible.

Any good ideas, anyone?

MattBianco
  • 3,676
  • 6
  • 27
  • 43
  • Have you tried option `--unbuffered`? – ctrl-alt-delor Jun 16 '14 at 12:26
  • With or without `--unbuffered` runs out of memory – MattBianco Jun 16 '14 at 12:28
  • What does `$!` do? – ctrl-alt-delor Jun 16 '14 at 12:33
  • What is wrong with the first sed command. The second seems to be reading everything into pattern space, I don't know that the `$!` is though. This I expect will need a **LOT** of memory. – ctrl-alt-delor Jun 16 '14 at 12:35
  • The problem is that sed reads everything as lines, that's why the first command doesn't remove the newlines, since it outputs the text row-by-row again. The second command is just a workaround. I think `sed` is not the proper tool in this case. – MattBianco Jun 16 '14 at 12:40
  • `sed` is the perfect tool for this case - but `$!` loops back to `b`ranch `:a` until it reaches the last line. Look at steeldriver's answer - his keeps 2 lines in memory as opposed to 2gbs. – mikeserv Jun 16 '14 at 13:42
  • @MattBianco, if you are looking for a different solution, you are better to add a separate question. – Graeme Jun 16 '14 at 14:21
  • I ended up using `gsar` [like this](http://unix.stackexchange.com/a/137600/5923). – MattBianco Jun 17 '14 at 12:20

8 Answers8

14

This really is trivial in Perl, you shouldn't hate it!

perl -i.bak -pe 's/>\n/>/' file

Explanation

  • -i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
  • -pe : read the input file line by line and print each line after applying the script given as -e.
  • s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2 
terdon
  • 234,489
  • 66
  • 447
  • 667
  • 3
    +1. awk golf: `awk '{ORS=/>$/?"":"\n"}1'` – glenn jackman Jun 16 '14 at 13:03
  • 1
    Why I dislike perl in general is the same reason why I chose this answer (or actually your comment to Gnouc's answer): readability. Using perl -pe with a simple "sed pattern" is way more readable than a complex sed-expression. – MattBianco Jun 16 '14 at 13:05
  • 3
    @MattBianco fair enough but, just so you know, that has nothing to do with Perl. The lookbehind that Gnouc used is a feature of some regular expression languages (including, but not limited to, PCREs), not Perl's fault at all. Also, after featuring this sed monstrosity `':a;N;$!ba;s/>\n/>/g'` in your question, you've waived your right to complain about readability! :P – terdon Jun 16 '14 at 13:21
  • @glennjackman nice! I was playing with the `foo ? bar : baz` construct but couldn't get it to work. – terdon Jun 16 '14 at 13:22
  • @terdon: Yeap, my mistake. Delete it. – cuonglm Jun 16 '14 at 13:44
  • @terdon I never claimed to understand the sed monstrosity I put in the question. What I wanted was the first sed expression, which works fine with perl. However, even perl seems to run out of memory sometimes. Does it not work on "streams"? Very strange. I just got `Out of memory!` :-( This is when invoked in a pipe, without `-i` – MattBianco Jun 16 '14 at 14:00
  • @MattBianco huh, that is strange. The out of memory is probably due to however your system is buffering the pipe though. The perl command reads line by line so there should be no memory issues there. – terdon Jun 16 '14 at 14:07
  • @terdon well.. I continued building my script, and replaced all other newlines with a string as a next step. It was when dealing with that file I ran out of memory, since the line was then very very long. Are there no simple search-and-replace string tools from the unix days that are not line-oriented? – MattBianco Jun 16 '14 at 14:16
  • @MattBianco not that I know of (but there may be regardless). However, I really don't see how this perl snippet could possibly run out of memory since it never holds more than a single line in memory. I'm guessing it's your shell that is running out because of the way the pipe is being buffered. You might want to post a question explaining your entire workflow so we can help you with your final objective rather than each small step. – terdon Jun 16 '14 at 14:48
  • Monstrous is right - it was 2.5gbs! – mikeserv Jun 16 '14 at 22:17
  • I accept this answer because it was simple, readable, and did what I asked for. But I ended up using `gsar` which I needed for my other problem, explained in [this answer](http://unix.stackexchange.com/a/137600/5923). – MattBianco Jun 17 '14 at 12:19
7

A perl solution:

$ perl -pe 's/(?<=>)\n//'

Explaination

  • s/// is used for string substitution.
  • (?<=>) is lookbehind pattern.
  • \n matches newline.

The whole pattern meanings removing all newline that have > before it.

cuonglm
  • 150,973
  • 38
  • 327
  • 406
3

How about this:

sed ':loop
  />$/ { N
    s/\n//
    b loop
  }' file

For GNU sed, you can also try adding the -u (--unbuffered) option as per the question. GNU sed is also happy with this as a simple one-liner:

sed ':loop />$/ { N; s/\n//; b loop }' file
Graeme
  • 33,607
  • 8
  • 85
  • 110
  • That doesn't remove the last `\n` if the file ends in `>\n`, but that's probably preferable anyway. – Stéphane Chazelas Jun 16 '14 at 12:49
  • @StéphaneChazelas, why does the closing `}` need to be in a separate expression? will this not work as a multiline expression? – Graeme Jun 16 '14 at 12:56
  • 1
    That will work in POSIX seds with `b loop\n}` or `-e 'b loop' -e '}'` but not as `b loop;}` and certainly not as `b loop}` because `}` and `;` are valid in label names (though nobody in their right mind would use it. And that means GNU sed is not POSIX conformant) and the `}` command needs to be separated from the `b` command. – Stéphane Chazelas Jun 16 '14 at 13:00
  • @StéphaneChazelas, GNU `sed` is happy with all of the above even with `--posix`! The standard also has the following for brace expressions - `The list of sed functions shall be surrounded by braces and separated by s`. Does this not mean that semicolons should only be used outside of braces? – Graeme Jun 16 '14 at 13:15
  • @mikeserv, the loop is needed to handle consecutive lines ending in `>`. The original never had one, this was pointed out by Stéphane. – Graeme Jun 16 '14 at 13:58
  • @mikeserv, that's not the problem. The problem is that when you do the `N` the next line is removed from the input, so the only way to catch consecutive lines ending in `>` is to apply the regex to the patter buffer again. Try it and see `echo -e 'one\ntwo>\nthree>\nfour\nfive' | sed '/>$/!b;N;s/\n//'`. – Graeme Jun 16 '14 at 14:09
  • I know - that's why I deleted it. It will work if you swap it around though and use hold space: sed `H;s/.*//;x;/>$/{s/\n//;h;d}` - but the hold space would require you to clean it. You're better off that way. – mikeserv Jun 16 '14 at 19:05
  • @mikserv, I'm not sure how that one is working but it doesn't catch the last in the sequence of lines to be joined. Also I am getting blank lines inserted - `echo -e 'one>\ntwo>\nthree\nfour\nfive>\nsix' | sed 'H;s/.*//;x;/>$/{s/\n//;h;d}'` – Graeme Jun 16 '14 at 19:25
1

You should be able to use sed with the N command, but the trick will be to delete one line from the pattern space each time that you add another (so that the pattern space always contains only 2 consecutive lines, instead of trying to read in the whole file) - try

sed ':a;$!N;s/>\n/>/;P;D;ba'

EDIT: after re-reading Peteris Krumins' Famous Sed One-Liners Explained I believe a better sed solution would be

sed -e :a -e '/>$/N; s/\n//; ta'

which only appends the following line in the case that it's already made a > match at the end, and should conditionally loop back to handle the case of consecutive matching lines (it is Krumin's 39. Append a line to the next if it ends with a backslash "\" exactly except for the substitution of > for \ as the join character, and the fact that the join character is retained in the output).

steeldriver
  • 78,509
  • 12
  • 109
  • 152
1

sed doesn't provide a way to emit output without a final newline. Your approach using N fundamentally works, but stores incomplete lines in memory, and thus can fail if the lines become too long (sed implentations aren't typically designed to handle extremely long lines).

You can use awk instead.

awk '{if (/<$/) printf "%s", $0; else print}'

An alternative approach is to use tr to swap the newline character with a “boring”, frequently-occurring character. Space might work here — pick a character that tends to appear on every line or at least a large proportion of lines in your data.

tr ' \n' '\n ' | sed 's/> />/g' | tr '\n ' ' \n'
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
  • Both methods are already demonstrated here to better effect in other answers. And his approach with `sed` does not work without a 2.5gigabyte buffer. – mikeserv Jun 17 '14 at 02:29
  • Did anybody mention awk? Oh, I missed it, I'd only noticed perl in terdon's answer for some reason. Nobody mentioned the `tr` approach — mikeserv, you posted a different (valid, but less generic) approach that happens to also use `tr`. – Gilles 'SO- stop being evil' Jun 17 '14 at 07:55
  • *valid, but less generic* sounds to me like youve just called it a *working, targeted solution.* i think its hard to argue that such a thing isnt *useful* which is odd because it has 0 upvotes. The biggest difference i can see between my own solution and your *more generic* offering, is that mine *specifically* solves a problem, whereas yours *might generally.* That might make it worthwhile - and i may even reverse my vote - but theres also the pesky matter of the 7 hours between them and the *recurring* theme of your answers mimicking others. Can you explain this? – mikeserv Jun 17 '14 at 16:48
1

what about using ed?

ed -s test.txt <<< $'/fruits/s/apple/banana/g\nw'

(via http://wiki.bash-hackers.org/howto/edit-ed)

andrej
  • 171
  • 4
0

I ended up using gsar as described in this answer like this:

gsar -F '-s>:x0A' '-r>'
MattBianco
  • 3,676
  • 6
  • 27
  • 43
-1

There are a lot of ways to do this, and most here are really good, but I think this one's my favorite:

tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/'

Or even:

tr '>\n' '\n>' | sed 's/^>*//' | tr '\n>' '>\n'
mikeserv
  • 57,448
  • 9
  • 113
  • 229
  • I can’t get your first answer to work at all. While I admire the elegance of the second one, I believe that you need to remove the `*`. The way it is now, it will delete any blank lines following a line that ends with a `>`. … Hmm. Looking back at the question, I see that it’s a little ambiguous. The question says, “I want to remove all newlines that occur after a `>`, …” I interpret that to mean that `>\n\n\n\n\nfoo` should be changed to `\n\n\n\nfoo`, but I suppose `foo` might be the desired output. – Scott - Слава Україні Jun 17 '14 at 18:55
  • @Scott - I tested with variations on the following: `printf '>\n>\n\n>>\n>\n>>>\n>\nf\n\nff\n>\n' | tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/'` - that results in `>>>>>>>>>>f\n\nff\n\n` for me with the first answer. I am curious though what you're doing to break it though, because I'd like to fix it. As to the second point - I don't agree that it is ambiguous. The OP does not ask to remove *all* `>` *preceding* a `\n`ewline, but instead to remove *all* `\n`ewlines *following* a `>`. – mikeserv Jun 17 '14 at 19:01
  • 1
    Yes, but a valid interpretation is that, in `>\n\n\n\n\n`, only the first newline is after a `>`; all the others are following other newlines. Note that the OP’s “this is what I want, if only it worked” suggestion was `sed -e 's/>\n/>/g'`, not `sed -e 's/>\n*/>/g'`. – Scott - Слава Україні Jun 17 '14 at 19:09
  • 1
    @Scott - the suggestion did not work and never could. I don't believe that the code suggestion of someone who does not fully understand the code can be considered as valid an interpreting point as the plain language that person also uses. And besides, the output - if it actually worked - of `s/>\n/>/` on `>\n\n\n\n\n` would still be something that `s/>\n/>/` would edit. – mikeserv Jun 17 '14 at 19:11