36

I have an embedded linux system using Busybox (OpenWRT) - so commands are limited. I have two files that look like:

first file

aaaaaa
bbbbbb
cccccc
mmmmmm
nnnnnn

second file

mmmmmm
nnnnnn
yyyyyy
zzzzzz

I need to merge these 2 lists into 1 file, and remove the duplicates. I don't have diff (space is limited) so we get to use the great awk, sed, and grep (or other tools that might be included in a standard Busybox instance). Going to a merge file like:

command1 > mylist.merge 
command2 mylist.merge > originallist

is totally ok. It doesn't have to be a single-line command.

Currently defined functions in the instance of Busybox that I am using (default OpenWRT): [, [[, arping, ash, awk, basename, brctl, bunzip2, bzcat, cat, chgrp, chmod, chown, chroot, clear, cmp, cp, crond, crontab, cut, date, dd, df, dirname, dmesg, du, echo, egrep, env, expr, false, fgrep, find, free, fsync, grep, gunzip, gzip, halt, head, hexdump, hostid, hwclock, id, ifconfig, init, insmod, kill, killall, klogd, less, ln, lock, logger, logread, ls, lsmod, md5sum, mkdir, mkfifo, mknod, mktemp, mount, mv, nc, netmsg, netstat, nice, nslookup, ntpd, passwd, pgrep, pidof, ping, ping6, pivot_root, pkill, poweroff, printf, ps, pwd, reboot, reset, rm, rmdir, rmmod, route, sed, seq, sh, sleep, sort, start-stop-daemon, strings, switch_root, sync, sysctl, syslogd, tail, tar, tee, telnet, telnetd, test, time, top, touch, tr, traceroute, true, udhcpc, umount, uname, uniq, uptime, vconfig, vi, watchdog, wc, wget, which, xargs, yes, zcat

7 Answers7

46

I think

sort file1 file2 | uniq
aaaaaa
bbbbbb
cccccc
mmmmmm
nnnnnn
yyyyyy
zzzzzz

will do what you want.

Additional Documentation: uniq sort

Jon
  • 165
  • 1
  • 8
20

In just one command without any pipe :

sort -u FILE1 FILE2

search

Suppress duplicate lines

-> http://www.busybox.net/downloads/BusyBox.html

Gilles Quénot
  • 31,569
  • 7
  • 64
  • 82
  • 1
    which one is better for very large files? `sort file1 file2 file3 file4 | uniq` or `sort -u file1 file2 file3 file4` – 0x90 Jan 27 '19 at 05:52
4

Another solution:

awk '!a[$0]++' file_1 file_2
nowy1
  • 621
  • 2
  • 7
  • 15
  • I saw that it made a difference which argument came first. Otherwise great solution, thanks. – dza Jan 08 '17 at 15:21
  • It boggles my mind that if someone posts e.g. a Python solution to a problem, people often request an explanation of the code, but whenever it's awk, which relatively few people understand, it's fine to just leave it as is. :) – damd Jan 22 '21 at 09:11
3

The files on your question are sorted.
If the source files are indeed sorted, you can uniq and merge in one step:

sort -um file1 file2 > mylist.merge

For numeric sort (not alphanumeric), use:

sort -num file1 file2 > mylist.merge

That could not be done in-place (redirected to one source file).

If the files are not sorted, sort them (this sort could be done in place, using the sort option -o. However, the whole file needs to be loaded into memory):

sort -uo file1 file1
sort -uo file2 file2
sort -um file1 file2 > mylist.merge
mv mylist.merge originallist

That would be faster than the simpler "one command line" to sort all:

cat file1 file2 | sort -u >mylist.merge

However, this line could be useful for small files.

2

To sort according to some key column use following :

awk '!duplicate[$1,$2,$3]++' file_1 file_2

here consider first, second and third column as your primary key.

Prem Joshi
  • 437
  • 1
  • 3
  • 10
0

If using variables instead of files:

sort -u <(printf '%s\n' "${list1}") <(printf '%s\n' "${list2}")
0
awk '{if(!seen[$0]++)print $0}' file1 file2

output

aaaaaa
bbbbbb
cccccc
mmmmmm
nnnnnn
yyyyyy
zzzzzz

Python

#!/usr/bin/python
import re
final_list=[]
g=open('file1','r')
for vb in g:
    if vb.strip() not in final_list:
        final_list.append(vb.strip())
g.close()        
b=open('file2','r')
for hj in b:
    if hj.strip() not in final_list:
        final_list.append(hj.strip())

b.close()
print "\n".join(final_list)

output

aaaaaa
bbbbbb
cccccc
mmmmmm
nnnnnn
yyyyyy
zzzzzz
Praveen Kumar BS
  • 5,139
  • 2
  • 9
  • 14