1

I have a data file with 10,000 columns and 117,000 rows. my original data has a lot of repetition within each columns. it is like:

inputfile.txt :

    123 124 111 
    321 124 111 
    123 000 111 
    123 111 222

I want to keep one copy of each value within each column like:

    123 124 111
    321 000 222
        111 

I need a program to do all columns together since I have 10,000 columns.

Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
zara
  • 1,313
  • 3
  • 14
  • 24
  • I am confused by your sample command and the input & output files. That command on that input file does not produce the given output. Can you explain how you translate the given input to the given output, or does the sample command do what you want to do (just stopping short of all 10,000 columns) ? – Jeff Schaller Sep 09 '15 at 00:39
  • If the given command does what you want, then the answer may be as simple as: `sort input.txt | uniq > output.txt` assuming you don't care if the input is re-ordered; there are ways around that requirement, if needed. – Jeff Schaller Sep 09 '15 at 00:40
  • no I tried your command. it does not do what I want. it just sort values based on first column. it does not even remove duplicates – zara Sep 09 '15 at 01:09
  • I understand, now. Are you opposed to a perl script? – Jeff Schaller Sep 09 '15 at 01:20
  • Does each column has fixed width in characters? – yaegashi Sep 09 '15 at 01:22
  • no .Actually I am beginner in Linux and writing program. So I just follow any script which gives me what I need. I appreciate if you guide me. – zara Sep 09 '15 at 01:23
  • yes all columns and rows have 5 digits. – zara Sep 09 '15 at 01:24
  • So are the columns all independent i.e. the value in column 3 of a given row does not need to stay connected to the values in columns 1 or 2 of the same row? This seems to follow from your example, and means you need to process each column separately. And how many columns - exactly 3 or could be more or less on some rows? – gogoud Sep 09 '15 at 09:27
  • that is true. the are independent. the number of column are about 10,000 – zara Sep 09 '15 at 14:15
  • @don_crissti - thank you. I did not know that. may be that is why no body helped me – zara Sep 09 '15 at 19:13
  • @gogoud -that is true. the are independent. the number of column are about 10,000 – zara Sep 09 '15 at 19:14
  • @yaegashi-yes all columns and rows have 5 digits. – zara Sep 09 '15 at 19:15
  • @don_crissti - yes I can share a part of my data with you. do you have dropbox or s gmail? I can then share a small part of my real data for you. this is my gmail: [email protected] – zara Sep 10 '15 at 16:27
  • @don_crissti: Ah, sorry, my mis-reading. – cuonglm Sep 10 '15 at 16:44

1 Answers1

1

This should do what you require in 5 lines of code (2 of which are just tidying):

#!/bin/bash
# run this, specifying input file as $1 (parameter 1)

# delete any pre-existing column files from /tmp
find /tmp -maxdepth 1 -name "column*" -delete

# create /tmp/columnN files - each file holds one column of $1
awk '{for (f=1; f<=NF; f++) {print $f >>"/tmp/column"f}}' "$1"

# iterate through column files, sorting and removing duplicates
find /tmp -maxdepth 1 -name "column*" -execdir sort -o \{\} -u \{\} \;

# re-combine columns and output to stdout
paste /tmp/column*

# delete column files from /tmp
find /tmp -maxdepth 1 -name "column*" -delete

It is possible that with a very large number of columns (as you have) the paste command will fail because /tmp/column* cannot be fully expanded.

A difference in output from your example is that each column's output is sorted whereas in your original the 2nd column was unsorted.

gogoud
  • 2,613
  • 2
  • 14
  • 18
  • @ gogoud- how should I excute this code? my input file name is hap.txt . where should I put my input and output file? should I replace my input file name instead of -name in your code? – zara Sep 10 '15 at 16:32
  • @ramin: save my code as a file let's say as 'sorter.sh', then make it executable `chmod +x ./sorter.sh` and then run it supplying your file as the parameter: `./sorter.sh /path/to/hap.txt >/tmp/output.txt` - this will save the output to a new file /tmp/output.txt. – gogoud Sep 11 '15 at 17:06
  • @gogoud : I tried to run it but I got this error. could you run it once?skarimi@signal[18:57][~/mkhap]$ ./sorter.sh hap.txt > sorteroutput.txt -bash: ./sorter.sh: Permission denied skarimi@signal[18:57][~/mkhap]$ – zara Sep 11 '15 at 22:54
  • @done_crissti: I get this error now: skarimi@signal[20:08][~/mkhap/new]$ ./sorter.sh hap.txt > /tmp/sorteroutput.txt find: The current directory is included in the PATH environment variable, which is insecure in combination with the -execdir action of find. Please remove the current directory from your $PATH (that is, remove "." or leading or trailing colons) paste: /tmp/column4683: Too many open files skarimi@signal[20:10][~/mkhap/new]$ why is that? – zara Sep 12 '15 at 00:11
  • @gogoud: I get this error now: skarimi@signal[20:08][~/mkhap/new]$ ./sorter.sh hap.txt > /tmp/sorteroutput.txt find: The current directory is included in the PATH environment variable, which is insecure in combination with the -execdir action of find. Please remove the current directory from your $PATH (that is, remove "." or leading or trailing colons) paste: /tmp/column4683: Too many open files skarimi@signal[20:10][~/mkhap/new]$ why is that? – zara Sep 12 '15 at 00:16
  • @ramin: Re the first message, to remove it (though it doesn't seem to be preventing the script from running) I guess you need to put the sorter.sh in a different directory. The usual place would be /opt. Re the second error, this is a problem I thought might arise because you have so many columns. It could be coded round but I am not here for a week so maybe someone else can help? @done_crissti? – gogoud Sep 12 '15 at 05:27
  • There are ways to get around the last error, see [here](http://unix.stackexchange.com/q/205642) and [here](https://rtcamp.com/tutorials/linux/increase-open-files-limit/) – don_crissti Sep 12 '15 at 11:01