5

I have a bunch of .csv files with N columns and different number of rows (lines). I would like to add as many empty lines ;...; (N semicolons) to make them the same length. I can get the length of the longest file manually but it would also be good to get this done automatically.

For example:

I have,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62

I need,

file1.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
171; pep; 73; 22:26:10; 3; 72
;;;;;
;;;;;
;;;;;

file2.csv

128; pep; 93; 22:22:10; 3; 11
127; qep; 93; 12:52:10; 3; 15
121; fng; 96; 09:42:10; 3; 52
141; gep; 53; 21:22:10; 3; 62
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892

file3.csv

121; fng; 96; 09:42:10; 3; 52
171; pep; 73; 22:26:10; 3; 72
221; ahp; 93; 23:52:10; 3; 892
141; gep; 53; 21:22:10; 3; 62
;;;;;
;;;;;
Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
myradio
  • 333
  • 1
  • 10
  • 1
    A simple (but probably not optimal) way to do it would be to use `wc` to count the line count of each file to find the max. You can then `echo ";;;;" >> file` in each file until the line count reach the max. – Bear'sBeard Dec 04 '18 at 10:45
  • 1
    Why do you want the files to have the same number of lines? Maybe there is a good method, where you can use the files as they are (with their different number of lines). – sudodus Dec 04 '18 at 11:12
  • @Bear'sBeard Yep, something like that did it, I was looking for a more compact way. – myradio Dec 04 '18 at 11:50
  • @sudodus Well, there're people before and after me in the pipeline, things must match certain formats... – myradio Dec 04 '18 at 11:51

3 Answers3

3

Thanks @Sparhawk for the suggestions in the comments, I update based on those,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in files*pattern.txt;do
    lineNumber=$(wc -l < $name)
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

Well, not elegand nor efficient. Actually, it takes a couple of seconds which sounds an eternity given the small size of the data. Nevertheless it works,

#!/bin/bash

emptyLine=;;;;;;;
rr=($(wc -l files*pattern.txt |  awk '{print $1}' | sed '$ d'))
max=$(echo "${rr[*]}" | sort -nr | head -n1)
for name in $(ls files*pattern.txt);do
    lineNumber=$(cat $name | wc -l )
    let missing=max-lineNumber
    for((i=0;i<$missing;i++));do
        echo $emptyLine >> $name
    done
done

I just put this file together in the directory where I have the files provided that there is a pattern I can use to list them with files*pattern.txt

myradio
  • 333
  • 1
  • 10
  • 1
    Nice one (+1)! Thanks for posting up the solution. One small note, [don't parse ls](https://unix.stackexchange.com/questions/128985/why-not-parse-ls-and-what-do-to-instead), just use `for name in files*pattern.txt; do` instead. – Sparhawk Dec 04 '18 at 12:28
  • 1
    And while I'm nit-picking, there's a "useless use of cat" there too. Just do `wc -l $name` – Sparhawk Dec 04 '18 at 12:30
  • 2
    @Sparhawk: I think you meant `wc -l < $name` – Thor Dec 04 '18 at 12:37
  • @Thor No? `wc [OPTION]... [FILE]...` works too, as per the `man`. In fact, this script uses this construction in an earlier line. – Sparhawk Dec 04 '18 at 20:54
  • @Sparhawk: Sure, but if you wanted the equivalent output of `cat file | wc -l` redirection is the way to go. – Thor Dec 05 '18 at 08:07
  • @Thor Ahhhh, yes okay got it! – Sparhawk Dec 05 '18 at 08:38
2

An improvement of @myradio's answer.
The part inside the loop written in awk which should be much faster.

max=$(wc -l file*.csv | sed '$ d' | sort -n | tail -n1 | awk '{print $1}' )
for f in file*.csv; do
    awk -F';' -v max=$max \
      'END{
         s=sprintf("%*s",FS,"");
         gsub(/ /,"-",s);
         for(i=NR;i<max;i++)
           print s;
       }' "$f" >> "$f"
done

With -F you set the correct field separator of your files (here -F';').

The s=sprintf();gsub(); part dynamically sets the right amount of the FS (= field separator) (via).
You could simply replace that with print ";;;;;" or other static content if you like.

pLumo
  • 22,231
  • 2
  • 41
  • 66
  • I like this solution. It's certainly harder to read but is good that had a dynamic `FS`. Nevertheless, 2 things: 1. About the efficiency, I don't know what your time results mean because this depends on (number and size of) the files. 2. I actually wanted to try this to compare the time results with my version, but I got a problem, `awk` complains about `gensub` being undefined. Is `gensub` maybe on gawk instead? – myradio Dec 05 '18 at 09:22
  • yeah that seems to be GNU Awk. I replaced it with the `gsub` solution from the linked answer. – pLumo Dec 05 '18 at 09:52
  • About the time results, I think I misunderstood the statement "it takes a couple of seconds" from your answer to be the time needed to process the example files from your question ... I removed that. – pLumo Dec 05 '18 at 09:54
1

In order to count the lines in each file only once:

wc -l *csv |sort -nr| sed 1d | {
    read max file
    pad=$(sed q "$file"|tr -cd ";")  # extract separators from first record
    while read lines file ; do
        while [ $((lines+=1)) -le $max ] ; do
                echo "$pad" >> "$file"
        done
    done
}

Note that any newlines in the filenames will cause problems for both sort and the while read loop, but they can handle filenames containing normal spaces.

JigglyNaga
  • 7,706
  • 1
  • 21
  • 47