7

So I have 2 very big text files, consisting of lines like so:

First:

Robert:Dillain:Other:Other:Other
Julian:Brude:Other:Other:Other
Megan:Flikk:Other:Other:Other
Samantha:Minot:Other:Other:Other
Jesus:Kimmel:Other:Other:Other

Second:

Sb:Minot:amsen
Jbb:Kimmel:verlin
R:Dillain:bodent
Mb:Flikk:kentin
Jb:Brude:kemin

I would like to match them both by the second column (Dillain, Brude, etc) and paste them to lines like so:

OUTPUT:

Robert:Dillain:Other:Other:Other:R:Dillain:bodent
Jesus:Kimmel:Other:Other:Other:Jbb:Kimmel:verlin
Samantha:Minot:Other:Other:Other:Sb:Minot:amsen
etc...
etc...

I was thinking of using sed for this, but anything Unix based would be great. I have had no luck trying to come up with a way to do this myself.

don_crissti
  • 79,330
  • 30
  • 216
  • 245
user104391
  • 71
  • 2

4 Answers4

9

This sounds like a task for join:

join -t":" -o "1.1,1.2,1.3,1.4,1.5,2.1,2.2,2.3" \
   -j 2 <(sort -k2,2 -t: test1) <(sort -k2,2 -t: test2)

Output:

Julian:Brude:Other:Other:Other:Jb:Brude:kemin
Robert:Dillain:Other:Other:Other:R:Dillain:bodent
Megan:Flikk:Other:Other:Other:Mb:Flikk:kentin
Jesus:Kimmel:Other:Other:Other:Jbb:Kimmel:verlin
Samantha:Minot:Other:Other:Other:Sb:Minot:amsen

Breakdown:

  • -t set field delimiter to :
  • -o set print format
  • -j join on column number 2
  • <(sort -k2,2 -t: file) pre-sort file by -k second column -t set field delimiter to :
devnull
  • 5,331
  • 21
  • 36
  • 1
    @mikeserv Good point. I updated it with `-k2,2`. Yeah `join` is a funny useful tool that most tend to forget is on the system -- (I know I do), since not every day I need to join 2 files together. I tend to use `join` as frequently as there are solar eclipses. lol – devnull Feb 25 '15 at 03:58
5

This is simple task for awk:

awk -F':' -vOFS=':' 'NR==FNR{a[$2]=$0;next}{print $0,a[$2]}' file2 file1

First we set : as field separator both for input (with -F) and output (with OFS) then if first file is processed (file2) we assign whole line to table element indexed with second field. When next next file (file1) is processed we print its lines adding the line from previous file stored in a[$2]).

jimmij
  • 46,064
  • 19
  • 123
  • 136
2

With sed you can probably do:

sed 's|[^:]*:\([^:]*\).*|/^[^:]*:\1:/s/$/:&/;t|' file2 | sed -f - file1

...which would involve one sed process reading the second file and writing a sed script for editing the first into a second sed's stdin. As near as I can tell you shouldn't have any problem with directly injecting the contents verbatim into a regexp like that. If there is the possibility of meta-characters in input, there are plenty of answers on this site which discuss means of escaping them. If it might be required, though, the following would be enough:

sed 's|[]&\./*[]|\\&|g;s|...' ... | sed -f - file1

Still, probably the eponymous join is the better solution - this is just to demonstrate how to do it w/ sed because you mentioned it.

Anyway, the script that the second sed applies to file1 winds up looking like (with a line similiar to the below for every line in file2):

/^[^:]*:Dillain:/s/$/:R:Dillain:bodent/;t

...which means that if it encounters a line matching Dillain for the second colon-delimited field, then it should append the :R:Dillain:bodent string to its tail. Because there's probably no sense in continuing to attempt to match a line in file1 if a line from file2 has already been appended, the trailing test command just branches away any successful substitution as soon as it is complete.

mikeserv
  • 57,448
  • 9
  • 113
  • 229
0

Through python3

#!/usr/bin/python3
import csv
import sys
file1, file2 = sys.argv[1], sys.argv[2]
with open(file2) as second, open(file1) as first:
    second_list = second.readlines()
    first_list = first.readlines()
for line1 in first_list:
    for line2 in second_list:
        if line1.split(':')[1] == line2.split(':')[1]:
            print(line1.strip()+line2.strip())

Copy and paste the above script in a file called script.py. And then run the script by running the below command on the terminal.

python3 script.py file1 file2
Avinash Raj
  • 3,653
  • 4
  • 20
  • 34