Matching and pasting to line

Question

So I have 2 very big text files, consisting of lines like so:

First:

Robert:Dillain:Other:Other:Other
Julian:Brude:Other:Other:Other
Megan:Flikk:Other:Other:Other
Samantha:Minot:Other:Other:Other
Jesus:Kimmel:Other:Other:Other

Second:

Sb:Minot:amsen
Jbb:Kimmel:verlin
R:Dillain:bodent
Mb:Flikk:kentin
Jb:Brude:kemin

I would like to match them both by the second column (Dillain, Brude, etc) and paste them to lines like so:

OUTPUT:

Robert:Dillain:Other:Other:Other:R:Dillain:bodent
Jesus:Kimmel:Other:Other:Other:Jbb:Kimmel:verlin
Samantha:Minot:Other:Other:Other:Sb:Minot:amsen
etc...
etc...

I was thinking of using sed for this, but anything Unix based would be great. I have had no luck trying to come up with a way to do this myself.

Your output record order does not seem to follow that of either input file - is that significant? — steeldriver, Feb 25 '15 at 02:04
How do you mean? do you mean that it does not seem to be the desired output? Cause it looks like it does but i may have missed something — user104391, Feb 25 '15 at 02:10

devnull · Answer 1 · 2015-02-25T04:26:39.107

9

This sounds like a task for join:

join -t":" -o "1.1,1.2,1.3,1.4,1.5,2.1,2.2,2.3" \
   -j 2 <(sort -k2,2 -t: test1) <(sort -k2,2 -t: test2)

Output:

Julian:Brude:Other:Other:Other:Jb:Brude:kemin
Robert:Dillain:Other:Other:Other:R:Dillain:bodent
Megan:Flikk:Other:Other:Other:Mb:Flikk:kentin
Jesus:Kimmel:Other:Other:Other:Jbb:Kimmel:verlin
Samantha:Minot:Other:Other:Other:Sb:Minot:amsen

Breakdown:

-t set field delimiter to :
-o set print format
-j join on column number 2
<(sort -k2,2 -t: file) pre-sort file by -k second column -t set field delimiter to :

edited Feb 25 '15 at 04:26

answered Feb 25 '15 at 02:39

devnull

5,331
21
36

1

@mikeserv Good point. I updated it with `-k2,2`. Yeah `join` is a funny useful tool that most tend to forget is on the system -- (I know I do), since not every day I need to join 2 files together. I tend to use `join` as frequently as there are solar eclipses. lol – devnull Feb 25 '15 at 03:58

jimmij · Answer 2 · 2015-02-25T02:06:59.147

5

This is simple task for awk:

awk -F':' -vOFS=':' 'NR==FNR{a[$2]=$0;next}{print $0,a[$2]}' file2 file1

First we set : as field separator both for input (with -F) and output (with OFS) then if first file is processed (file2) we assign whole line to table element indexed with second field. When next next file (file1) is processed we print its lines adding the line from previous file stored in a[$2]).

edited Feb 25 '15 at 02:06

answered Feb 25 '15 at 02:01

jimmij

46,064
19
123
136

mikeserv · Answer 3 · 2015-02-27T06:00:58.773

With sed you can probably do:

sed 's|[^:]*:\([^:]*\).*|/^[^:]*:\1:/s/$/:&/;t|' file2 | sed -f - file1

...which would involve one sed process reading the second file and writing a sed script for editing the first into a second sed's stdin. As near as I can tell you shouldn't have any problem with directly injecting the contents verbatim into a regexp like that. If there is the possibility of meta-characters in input, there are plenty of answers on this site which discuss means of escaping them. If it might be required, though, the following would be enough:

sed 's|[]&\./*[]|\\&|g;s|...' ... | sed -f - file1

Still, probably the eponymous join is the better solution - this is just to demonstrate how to do it w/ sed because you mentioned it.

Anyway, the script that the second sed applies to file1 winds up looking like (with a line similiar to the below for every line in file2):

/^[^:]*:Dillain:/s/$/:R:Dillain:bodent/;t

...which means that if it encounters a line matching Dillain for the second colon-delimited field, then it should append the :R:Dillain:bodent string to its tail. Because there's probably no sense in continuing to attempt to match a line in file1 if a line from file2 has already been appended, the trailing test command just branches away any successful substitution as soon as it is complete.

Weird. I was awarded the `awk` tag badge for this: I don't even know how to use `awk`. — mikeserv, Feb 26 '15 at 03:35

score 0 · Answer 4 · answered Feb 25 '15 at 11:58

Through python3

#!/usr/bin/python3
import csv
import sys
file1, file2 = sys.argv[1], sys.argv[2]
with open(file2) as second, open(file1) as first:
    second_list = second.readlines()
    first_list = first.readlines()
for line1 in first_list:
    for line2 in second_list:
        if line1.split(':')[1] == line2.split(':')[1]:
            print(line1.strip()+line2.strip())

Copy and paste the above script in a file called script.py. And then run the script by running the below command on the terminal.

python3 script.py file1 file2