8

I have 20 tab delimited files with the same number of rows. I want to select every 4th column of each file, pasted together to a new file. In the end, the new file will have 20 columns with each column come from 20 different files.

How can I do this with Unix/Linux command(s)?

Input, 20 of this same format.  I want the 4th column denoted here as A1 for file 1:

chr1    1734966 1735009 A1       0       0       0       0       0       1       0
chr1    2074087 2083457 A1       0       1       0       0       0       0       0
chr1    2788495 2788535 A1       0       0       0       0       0       0       0
chr1    2821745 2822495 A1       0       0       0       0       0       1       0
chr1    2821939 2822679 A1       1       0       0       0       0       0       0
...

Output file, with 20 columns, each column coming from one of the 20 files' 4th column:

A1       A2       A3       ...       A20
A1       A2       A3       ...       A20
A1       A2       A3       ...       A20
A1       A2       A3       ...       A20
A1       A2       A3       ...       A20
...
Jun Cheng
  • 105
  • 1
  • 1
  • 5
  • cut is the command which gets colomun from file. and paste is another command which pastes colomns horizontally. check: man cut , man paste – Vineeth Chowdhary Sep 30 '14 at 12:49
  • 3
    Please [edit] your question and give us an example of your input files and your desired output. How are columns defined? Spaces? Commas? Tabs? Something else? – terdon Sep 30 '14 at 12:54
  • I changed your question to make it more direct, as others (and maybe you) might want to know **how** to do what you are asking, not just if people exists that have the capability to solve such a problem. – Anthon Sep 30 '14 at 15:24
  • Thanks for the comments. I edited my question. Hope is clear know. – Jun Cheng Oct 01 '14 at 08:01
  • 2
    @JunCheng `paste <(cut -f 4 1.txt) <(cut -f 4 2.txt) .... <(cut -f 4 20.txt)`. That works because `cut` by default cuts on TAB delimited fields. If the question gets reopened I will post this as an answer as well. – Anthon Oct 01 '14 at 08:31
  • @Anthon, thanks a lot. Is there any way do not need to specify (cut -f 4 1.txt) <(cut -f 4 2.txt) .... <(cut -f 4 20.txt), in case there are 100+ files or uncertain number of files? – Jun Cheng Oct 01 '14 at 13:05
  • @JunCheng You can paste the first two files in out.txt and then incrementally paste the output of `out.txt` and each following to an `out2.txt`, move that `out2.txt` to `out.txt` and do the next. But by then I personally would make a Python script and make lists for each row and append, and dump the result when all files are parsed. I don't think you can parametrize `<(cut ...)` – Anthon Oct 01 '14 at 13:13
  • It is kind of unfortunate it takes so long to get the five reopen votes (only one more to go) – Anthon Oct 01 '14 at 13:56

2 Answers2

5

with paste under bash you can do:

paste <(cut -f 4 1.txt) <(cut -f 4 2.txt) .... <(cut -f 4 20.txt)

With a python script and any number of files (python scriptname.py column_nr file1 file2 ... filen):

#! /usr/bin/env python

# invoke with column nr to extract as first parameter followed by
# filenames. The files should all have the same number of rows

import sys

col = int(sys.argv[1])
res = {}

for file_name in sys.argv[2:]:
    for line_nr, line in enumerate(open(file_name)):
        res.setdefault(line_nr, []).append(line.strip().split('\t')[col-1])

for line_nr in sorted(res):
    print '\t'.join(res[line_nr])
Anthon
  • 78,313
  • 42
  • 165
  • 222
2

The following script does this using awk. I have added for convenience a rownumber, which indicates the number of rows in your files (r). The number of columns you'd like to paste is indicated by c.

directory=/your-directory/
r=4
c=20

for n in $(seq 1 $r); do
echo "$n" >> rownumber.txt
done

for n in $(seq 1 $c); do
awk '{ print $4}' /$directory/file-$n.txt > /$directory/output-$n.txt
done

paste /$directory/rownumber.txt /$directory/output-[1-$c]*.txt > /$directory/newfile.txt
Ruthger Righart
  • 241
  • 3
  • 8