3

I have multiple files, let's say file1, file2 etc. Each file has one word in each line, like:

file1 file2 file3
one   four  six
two   five
three

What I want is to combine them in a new file4 in every possible permutation (without repetition) in pairs. Like

onetwo
onethree
onefour
onefive
...
twothree
...
onefour
...
fourone
...

How is this possible using Linux commands?

agc
  • 7,045
  • 3
  • 23
  • 53
mpla_mpla
  • 143
  • 1
  • 4
  • 1
    Is this homework? – agc May 30 '16 at 14:06
  • 1
    no i am trying to attack a hash with John The Ripper and i need to combine different files accordingly – mpla_mpla May 30 '16 at 14:32
  • The files sizes are relevant. If you replace "file*" with the actual file names, what does `wc file* | tail -n1` output? – agc May 30 '16 at 14:42
  • The description says _combination_, but the "want" list also includes a _permutation_, namely: "fourone". At present, the question is unclear. See [combinations and permutations](https://www.mathsisfun.com/combinatorics/combinations-permutations.html). – agc May 30 '16 at 15:43
  • I undestand, my fault, it is permutation – mpla_mpla May 31 '16 at 12:07
  • @agc the output is `3362 3362 19820 total` – mpla_mpla May 31 '16 at 12:16
  • See also: [Command line tool to “cat” pairwise expansion of all rows in a file](http://unix.stackexchange.com/q/169625) – don_crissti May 31 '16 at 13:09
  • Now we have enough data. Based on the `wc`, we're not dealing with huge files, so execution speed and array size limits won't much matter in _this_ instance. Assuming the sample output above is correct then it's a _permutation without repitition_, with "n!/(n-r)!" items. – agc May 31 '16 at 15:54
  • @don_crissti, in my answer I used the shell's `set` command to hold an array of items; `set` is command line based, and is limited to a hair less than `getconf ARG_MAX` bytes, (on my system, that's about 2 megs). Since the OP's data is only 20K, (i.e. 1% of 2M), `set` is good enough. – agc May 31 '16 at 16:36
  • @agc - I saw your answer but max no. of args is one thing and _array size limit_ is another thing. – don_crissti May 31 '16 at 16:38
  • @don_crissti, thanks for the distinction, perhaps _array buffer size limit_ might have been a better description. – agc May 31 '16 at 16:51
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/40551/discussion-between-agc-and-don-crissti). – agc May 31 '16 at 17:03
  • "Command line tool to “cat” pairwise expansion of all rows in a file" is for _permutations with repetition_; (n^r) items. – agc May 31 '16 at 17:12

5 Answers5

2

ruby is a nice concise language for this kind of stuff

ruby -e '
  words = ARGV.collect {|fname| File.readlines(fname)}.flatten.map(&:chomp)
  words.combination(2).each {|pair| puts pair.join("")}
' file[123] > file4
onetwo
onethree
onefour
onefive
onesix
twothree
twofour
twofive
twosix
threefour
threefive
threesix
fourfive
foursix
fivesix

You're quite right, combination provides "onetwo" but misses "twoone". Good thing there's permutation

ruby -e '
  words = ARGV.collect {|fname| File.readlines(fname)}.flatten.map(&:chomp)
  words.permutation(2).each {|pair| puts pair.join("")}
' file{1,2,3}
onetwo
onethree
onefour
onefive
onesix
twoone
twothree
twofour
twofive
twosix
threeone
threetwo
threefour
threefive
threesix
fourone
fourtwo
fourthree
fourfive
foursix
fiveone
fivetwo
fivethree
fivefour
fivesix
sixone
sixtwo
sixthree
sixfour
sixfive
glenn jackman
  • 84,176
  • 15
  • 116
  • 168
1

Assuming the total size of the input files is smaller than getconf ARG_MAX, (i.e. the maximum command line length), then this should work:

set -- $( cat file[123] )
for f in $@ ; do
    for g in $@ ; do
        [ "$f" != "$g" ] && echo $f$g
    done
done > file4

cat file4 outputs:

onetwo
onethree
onefour
onefive
onesix
twoone
twothree
twofour
twofive
twosix
threeone
threetwo
threefour
threefive
threesix
fourone
fourtwo
fourthree
fourfive
foursix
fiveone
fivetwo
fivethree
fivefour
fivesix
sixone
sixtwo
sixthree
sixfour
sixfive

(As per OP clarification, the above is a revision for permutations without repetition. See previous draft for combinations without repetition.)

agc
  • 7,045
  • 3
  • 23
  • 53
1

A python solution:

import fileinput
from itertools import permutations
from contextlib import closing
with closing(fileinput.input(['file1', 'file2', 'file3'])) as f:
    for x, y in permutations(f, 2):
            print '{}{}'.format(x.rstrip('\n'), y.rstrip('\n'))

onetwo
onethree
onefour
onefive
onesix
twoone
twothree
twofour
twofive
twosix
threeone
threetwo
threefour
threefive
threesix
fourone
fourtwo
fourthree
fourfive
foursix
fiveone
fivetwo
fivethree
fivefour
fivesix
sixone
sixtwo
sixthree
sixfour
sixfive
iruvar
  • 16,515
  • 8
  • 49
  • 81
  • @ iruvar this is much faster than the bash solution similar to @agc I was using. – badner Jul 24 '17 at 15:29
  • @badner - nice - and the speed doesn't surprise me at all given that `python` file I/O and `itertools` are implemented in the C layer – iruvar Jul 24 '17 at 17:20
0

Use this:

cat FILE1 FILE2 FILE3 | \
    perl -lne 'BEGIN{@a}{push @a,$_}END{foreach $x(@a){foreach $y(@a){print $x.$y}}}'

Output:

oneone
onetwo
onethree
onefour
onefive
onesix
oneseven
twoone
twotwo
twothree
twofour
twofive
twosix
twoseven
threeone
threetwo
threethree
threefour
threefive
threesix
threeseven
fourone
fourtwo
fourthree
fourfour
fourfive
foursix
fourseven
fiveone
fivetwo
fivethree
fivefour
fivefive
fivesix
fiveseven
sixone
sixtwo
sixthree
sixfour
sixfive
sixsix
sixseven
sevenone
seventwo
seventhree
sevenfour
sevenfive
sevensix
sevenseven
agc
  • 7,045
  • 3
  • 23
  • 53
Baba
  • 3,159
  • 2
  • 25
  • 39
0

TXR Lisp:

Warmup: just get the data structure first:

$ txr -p '(comb (get-lines (open-files *args*)) 2)' file1 file2 file3
(("one" "two") ("one" "three") ("one" "four") ("one" "five") ("one" "six")
 ("two" "three") ("two" "four") ("two" "five") ("two" "six") ("three" "four")
 ("three" "five") ("three" "six") ("four" "five") ("four" "six")
 ("five" "six"))

Now just a matter of getting the right output format. If we catenate the pairs together and then use tprint (implicitly via the -t option), we are there.

First, the catenation via mapping through cat-str:

$ txr -p '[mapcar cat-str (comb (get-lines (open-files *args*)) 2)]' file1 file2 file3
("onetwo" "onethree" "onefour" "onefive" "onesix" "twothree" "twofour"
 "twofive" "twosix" "threefour" "threefive" "threesix" "fourfive"
 "foursix" "fivesix")

OK, we have the right data. Now just use tprint function (-t) instead of prinl (-p):

$ txr -t '[mapcar cat-str (comb (get-lines (open-files *args*)) 2)]' file1 file2 file3
onetwo
onethree
onefour
onefive
onesix
twothree
twofour
twofive
twosix
threefour
threefive
threesix
fourfive
foursix
fivesix

Finally, we read the question again and do permutations instead of combinations with perm rather than comb, as required:

$ txr -t '[mapcar cat-str (perm (get-lines (open-files *args*)) 2)]' file1 file2 file3
onetwo
onethree
onefour
onefive
onesix
twoone
twothree
twofour
twofive
twosix
threeone
threetwo
threefour
threefive
threesix
fourone
fourtwo
fourthree
fourfive
foursix
fiveone
fivetwo
fivethree
fivefour
fivesix
sixone
sixtwo
sixthree
sixfour
sixfive
Kaz
  • 7,676
  • 1
  • 25
  • 46