0

I have one lists with many duplicates, for example:

AARC
AARC
AARC
TNGT
TNGT
TNGT
CAAC
CAAC

And one list without any duplicates, for example:

AARC
TNGT
YUGT
BATR

etc.

All of the entries in the first list will appear in the second list, but not vice versa.

I want to compare the lists and find out how many entries are in both, however I want to retain and recognize duplicates. For example, the output could either be:

AARC
AARC
AARC
TNGT
TNGT
TNGT

Or

AARC\tAARC
AARC\tAARC
AARC\tAARC
TNGT\tTNGT
TNGT\tTNGT
TNGT\tTNGT

The issue I'm having is that comm grabs the first duplicate and moves on, counting subsequent entries as not being shared. Every article I can find online references removing duplicates, not retaining them. There used to be a database I could use for this, but they recently changed their default behavior to removing duplicates, and with thousands of entries I can't do it by hand :/

don_crissti
  • 79,330
  • 30
  • 216
  • 245
  • 1
    You say “All of the entries in the first list will appear in the second list”, but if that were true, the answer would be ``cat file1``.  The question becomes non-trivial because `CAAC` is in the first list but not in the second list. – G-Man Says 'Reinstate Monica' Mar 13 '18 at 01:55

1 Answers1

1

If I understand that well, you want to filter out all the words from first list that are not in the second list.

You can use grep for that. This command:

grep -w -f list2.txt list1.txt

Will output:

AARC
AARC
AARC
TNGT
TNGT
TNGT

Check also this thread.

BlueManCZ
  • 1,693
  • 12
  • 31