-1

I have similar instructions like in the thread Print Matching line and nth line from the matched line

I would need to match the specific line, print it, then remove the following line (1st following line) and then print the rest until to match the specific line etc.

In other words, I need to remove only lines with </s> which follows the line starting with <doc.

My file:

<doc>
</s>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
</s>
...

My required output:

<doc>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
...
Philippos
  • 13,237
  • 2
  • 37
  • 76
Rodrigo
  • 39
  • 6
  • 4
    It's not a code-writing service. You have to show us that you tried do to something – mrc02_kr Sep 07 '17 at 07:56
  • Of course the comment of @mrc02_kr is right. But +1 for giving a precise definition of the question, even with input and output. That's unusual. – Volker Siegel Sep 09 '18 at 11:45

3 Answers3

2

This is not too hard to figure out with basic sed knowledge:

sed '/<doc>/{n;/<\/s>/d;}'

For lines with <doc>, print it and read the next line with n and then, if this folloing line contains </s> (slash needs to be escaped), delete it with d.

More verbose explanation: /expression/{command;command;...;} means to execute the commands only on lines that match the pattern, so all other lines simply get printed as they are, while for the <doc> line, n is executed. This command prints the current line and reads the next one, so the following commands are executed on the next line. Here comes another command (d) with an "address" (/<\/s>/), thus the line is deleted only if it contains </s>, otherwise is printed. In either case the script will continue with the following line.

Philippos
  • 13,237
  • 2
  • 37
  • 76
  • thanks, this works perfectly! I did not know about using "{n". – Rodrigo Sep 11 '17 at 06:46
  • Actually, `{n` is not a command. I added are more detailed explanation for you. If the script solves your question, please consider marking it as answered for future readers. – Philippos Sep 11 '17 at 07:13
  • thanks, for the detailed explanation, now I completely understand. – Rodrigo Sep 11 '17 at 08:33
1

With GNU sed:

sed -z -i 's:<doc>\n</s>:<doc>:g' infile.txt

This is replacing <doc> followed by </s> with only <doc>. The sed's -i flag is used for in place replace; and the g flag is to replace all occurences. -z cause to separate lines with NULL characters.

αғsнιη
  • 40,939
  • 15
  • 71
  • 114
  • `\n` in the pattern is a POSIX requirement supported by all `sed` implementations I ever met. But as `sed` works linewise this can't ever match unless you join lines first or use `-z` option with GNU `sed` – Philippos Sep 07 '17 at 08:47
  • 1) making `-z` default would break 99% of all scripts! 2) Without `-z` your script doesn't work with GNU `sed` 4.4 (like it doesn't with any other version). 3) I suppose you are talking about `\n` in the replacement string. In the matching pattern I can't think of any version not supporting it. – Philippos Sep 07 '17 at 10:23
  • The second version will still not work on any `sed` version. – Philippos Sep 08 '17 at 05:44
  • @Philippos Correct, I removed that part from my answer, but may I ask you why `sed` doesn't recognize newline there in find part and again I should use `-z` with that?. thank you – αғsнιη Sep 08 '17 at 09:09
  • 2
    Because the default behaviour of `sed` is to read one line of input to the buffer ("pattern space"), apply the script on it and proceed with the next line. So there will never be a newline in the pattern space unless you use a command like `N`, which appends the next line to the pattern space, letting you have two lines with a newline in between. You can run `N;P;D` patterns to always have two consecutive lines in the buffer, but that's beyond what can be explained in a comment. – Philippos Sep 08 '17 at 09:17
0

As you marked shell_script I would suggest awk approach:

awk '/^<doc>/ && getline nl > 0 && nl!~/^<\/s>/{ print $0 RS nl }1' file

The output:

<doc>
<s>
Bla
bla
bla
.
</s>
<s>
Bla
bla
bla
.
</s>
</doc>
<doc>
...
RomanPerekhrest
  • 29,703
  • 3
  • 43
  • 67
  • This is not the real output of your script. Your script also removes the last line like all other lines following ``, although the question says *I need to remove only lines with which follows the line starting with * – Philippos Sep 07 '17 at 08:58
  • @Philippos, see my update – RomanPerekhrest Sep 07 '17 at 09:14
  • I suppose this works (I can't verify this as some shell magic gives me `!~/^: event not found` error for your script), and it confirms me to prefer `sed` over `awk` for most tasks. (-; – Philippos Sep 07 '17 at 10:01
  • @Philippos, strange, what's your shell that giving error? – RomanPerekhrest Sep 07 '17 at 10:03
  • `bash` 4.3.30 as well as 4.4.12, almost no custom settings. But I've seen it messing with `!` even inside single quotes before. It's rare, but nasty. Not related to your script though. – Philippos Sep 07 '17 at 10:15
  • Oh! I just realize this only happens if I `echo` the text and pipe it to your `awk` script. No problems when working on a file. But your sciprt gives me a trailing `` line for some reason (-: – Philippos Sep 07 '17 at 10:32
  • @Philippos, you may look https://ibb.co/bsoLYa – RomanPerekhrest Sep 07 '17 at 10:35
  • Yes, see my last comment. It happens only with piped input. – Philippos Sep 07 '17 at 10:37
  • I did analyze the problem. I suspect it's a bug, see [my question here](https://unix.stackexchange.com/questions/390931/bash-history-expansion-inside-single-quotes-after-a-double-quote-inside-the-sam) – Philippos Sep 08 '17 at 05:45
  • @Philippos, yes, you have posted an interesting question. I wish it was successfully answered – RomanPerekhrest Sep 08 '17 at 07:43