0

I need to parse a tab-separated CSV file using bash, examine the contents of the record, and, if the record meets certain criteria, add it to an array. Basically, I want to filter records out of a CSV file before doing something with them.

My thought was to take each row in the file, put each field into an array. I could then look at the array to see if the record meets certain conditions (e.g. field3="value", etc). If yes, I would then "reconstruct" the tab separated line and append it to a new array.

Where this seems to fail is the line where I create record. It appears to be appending a space rather than a tab because later on, the size of details is the same as if the record were space delimited instead of tab.

datafile=path/to/data.csv
records=()
header=$(head -n 1 $datafile)
IFS=$'\t' read -r -a fields <<< "$header"

while IFS=$'\t' read -r -a documents; do

    # processing to determine if current row in csv file matches certain criteria
    # if it does, the following will happen

    for r in ${documents[@]}; do record+="$r"$'\t'; done #appending space instead?
    records+="$record"
done < $datafile

for r in "${records[@]}"; do
    IFS=$'\t' read -r -a details <<< "$r"

    # size of details here is as if record is separated by spaces instead of tabs

    for i in "${!fields[@]}" ; do
        echo "${fields[i]}: ${details[i]}"
    done
done

Example: If this record is process:

Hello World  [TAB]  nice weather we are having today  [TAB]  do you agree?

The size of details should be 3, but I'm getting 11 instead. Why?

  • While you *can* do this in `bash`, you're probably better off doing it (or most of it) in a language that is better suited to the task, such as `awk` or `perl`. The code will be shorter, simpler, easier to read and understand, and run much faster than using bash arrays and `read` in multiple loops. – cas Apr 13 '16 at 23:02
  • @cas I'm not too good with `awk`, and this script does some things that I wasn't really how to get working in awk. You are right; it could certainly be more efficient. – Scribblemacher Apr 14 '16 at 12:02

2 Answers2

4

Your problem is covered Why does my shell script choke on whitespace or other special characters? . I'll just briefly explain what's going on here.

The culprit is for r in ${documents[@]}. Since the variable expansion is left unquoted, you're using the “split+glob” operation: the value of each array element is split into words according to the value of IFS, and each word is treated as a wildcard pattern. Since you only ever set IFS for the duration of the read (see Why is `while IFS= read` used so often, instead of `IFS=; while read..`?), the value of IFS at this point is the default one, which includes spaces. In addition, if you had a field containing something like foo *, you'd see file names in the current directory appear. The solution is for r in "${documents[@]}", which is the standard way of iterating over an array: the double quotes turn this into a straight variable dereference with no splitting and globbing, and the [@] causes each array element to be placed in a separate word.

While setting IFS=$'\t' for the whole script appears to solve the problem, it in fact only solves half the problem: it doesn't prevent globbing from happening with ${documents[@]}. While you can turn off globbing with set -f, using double quotes is clearer.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
1

The issue was apparently with the multiple declarations of IFS=$'\t'. Removing them and just having one declaration for IFS seems to have solved the problem.

(Although for the life of me, I don't see why this was an issue. There must have been a subtle typo.)