Delete string between two regex patterns

Question

I have a file with following contents

..\..\src\modules\core\abc\abc.cpp
..\..\src\modules\core\something\xyz\xyz.cpp
..\..\src\other_modules\new_core\something\pqr\pqr.cpp
..\..\src\other_modules\new_core\something\pqr\abc.cpp

The result I am expecting is

..\..\src\abc\abc.cpp
..\..\src\xyz\xyz.cpp
..\..\src\pqr\pqr.cpp
..\..\src\pqr\abc.cpp

How can I achieve this using sed?

I am unable to write an regular expression to capture two groups at the same time.

initial group (....\src) - this will be same in all the lines
variable group (abc\abc.cpp) or (xyz\xyz.cpp) or (pqr\pqr.cpp) or (pqr\abc.cpp)

heemayl · Accepted Answer · 2016-10-04T09:55:40.683

2

With BSD sed or recent versions of GNU sed (for older versions, replace -E with -r):

sed -E 's#(.*\\src).*(\\[^\]+\\[^\]+$)#\1\2#' file.txt

# is used as the delimiter for substitution (s) command of sed, to avoid ambiguity involving \s in the input
(.*\\src) matches upto src from start, and put the match in captured group 1
(\\[^\]+\\[^\]+$) matches the portion having two \s till end, and put in captured group 2, the .* preceding this matches everything in between the first and second captured groups
In the replacement we have used the two captured groups

POSIX-ly:

sed 's#\(.*\\src\).*\(\\[^\]\+\\[^\]\+$\)#\1\2#' file.txt

Example:

% cat file.txt
..\..\src\modules\core\abc\abc.cpp
..\..\src\modules\core\something\xyz\xyz.cpp
..\..\src\other_modules\new_core\something\pqr\pqr.cpp
..\..\src\other_modules\new_core\something\pqr\abc.cpp

% sed -E 's#(.*\\src).*(\\[^\]+\\[^\]+$)#\1\2#' file.txt
..\..\src\abc\abc.cpp
..\..\src\xyz\xyz.cpp
..\..\src\pqr\pqr.cpp
..\..\src\pqr\abc.cpp

edited Oct 04 '16 at 09:55

answered Oct 04 '16 at 05:22

heemayl

54,820
8
124
141

could you let me know why did you use `sed -E` – dhiraj suvarna Oct 04 '16 at 05:39
1

@phoenix The `-E` option lets us to use ERE (Extended RegEx) patterns, otherwise we have to use BRE (Basic RegEx) patterns. In practice, we then need to escape the `()`s in the captured groups, `+` token, and also `{}`, `?` if present and used as Regex token, otherwise these will be treated literally. Without `-E`: `sed 's#$.*\\src$.*$\\[^\]\+\\[^\]\+$$#\1\2#' file.txt` – heemayl Oct 04 '16 at 05:43
1

@phoenix, not only that, but `\+` (as in above comment) is much less portable than using `-E` and `+`. `-E` is [fairly standard](http://unix.stackexchange.com/q/310446/135943). – Wildcard Oct 04 '16 at 05:50
can also use `sed -E 's#(.*\\src).*((\\[^\]+){2})$#\1\2#'` to avoid repeating regex pattern, can easily change number as well if requirement changes.... – Sundeep Oct 04 '16 at 07:36

score 0 · Answer 2 · edited Oct 04 '16 at 09:20

0

Create a file with data

-rwxr-xr-x. 1 sasi   webApp  190 Oct  4 13:42 file.txt

Execute below command

[sasi@localhost temp]$ sed -E 's#(.*\\src).*(\\[^\]+\\[^\]+$)#\1\2#' file.txt
..\..\src\abc\abc.cpp
..\..\src\xyz\xyz.cpp
..\..\src\pqr\pqr.cpp
..\..\src\pqr\abc.cpp
[sasi@localhost temp]$
[sasi@localhost temp]$
[sasi@localhost temp]$

edited Oct 04 '16 at 09:20

Sparhawk

19,561
18
86
152

answered Oct 04 '16 at 05:47

sasikaran

1

it would be helpful if you format your answer, also I see an error in the regular expression that you are suggesting – dhiraj suvarna Oct 04 '16 at 05:50
sed -E 's#(.\src).(\[^]+\[^]+$)#\1\2#' file.txt – sasikaran Oct 04 '16 at 05:56
run above command with file.txt – sasikaran Oct 04 '16 at 05:58
that doesn't work – dhiraj suvarna Oct 04 '16 at 06:06
which OS using ? Bec i can get a result with same command .. – sasikaran Oct 04 '16 at 06:50
I am using cygwin in Windows OS. The pattern you mentioned in your answer and in your comment are different. Please check. – dhiraj suvarna Oct 04 '16 at 07:52
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/46276/discussion-between-phoenix-and-sasikaran). – dhiraj suvarna Oct 04 '16 at 08:08

score 0 · Answer 3 · answered Oct 04 '16 at 07:46

Alternate solutions:

With GNU grep and paste

grep extracts the two patterns .*\\src or (\\[^\]+){2}$ and prints them on separate lines. The output is then combined using paste

$ grep -oE '.*\\src|(\\[^\]+){2}$' ip.txt | paste -d '' - -
..\..\src\abc\abc.cpp
..\..\src\xyz\xyz.cpp
..\..\src\pqr\pqr.cpp
..\..\src\pqr\abc.cpp

With perl

$ perl -pe 's/.*\\src\K.*(?=(\\[^\\]+){2}$)//' ip.txt 
..\..\src\abc\abc.cpp
..\..\src\xyz\xyz.cpp
..\..\src\pqr\pqr.cpp
..\..\src\pqr\abc.cpp

Here the text between the patterns .*\\src and (\\[^\\]+){2}$ is deleted by making use of positive lookarounds

Thank you Sundeep, learnt grepping two patterns and also paste command. :) — dhiraj suvarna, Oct 04 '16 at 08:23

score 0 · Answer 4 · answered Oct 05 '16 at 00:16

Why bash this with regex? Path munging doesn't require regular expressions; OS kernels don't use regexes to follow paths.

With Awk, we just use backslash as a separator and components become fields:

awk 'BEGIN { FS = OFS = "\\" } { print $1, $2, $3, $(NF-1), $NF }'

Delete string between two regex patterns

4 Answers4