3

EDIT: the files were changed to tsv to deal better with spaces in text fields

I have 2 csv files in the following form:

File 1: availableText.csv (can be very big)

"id1" , "text1-1"
"id1" , "text1-2"
"id1" , "text1-3"
"id1" , "text1-4"
"id2" , "text2-1"
"id2" , "text2-2"
"id2" , "text2-3"
"id2" , "text2-4"
...

File 2: wrongText.csv

"id1" , "texta"
"id2" , "textb"
"id3" , "textc"
"id4" , "textd"
...

For every line in wrongText.csv, I want to filter the available text entries for the same id and suggest the best available option using tre-agrep (a grep-like functions that allows error in the pattern and using -B returns the best match)

For example, for id1:

tre-agrep -B 'texta' (from text1-1:4) | tr "\n" "$"
( will produce something like 'text1-2$text1-4' )

The desired output file would be like this:

"id1" , "texta" , "text1-2$text1-4"
"id2" , "textb" , "text2-1$text2-3$text2-4"

Note:

  1. The CSV can be converted to any format, but text may contain spaces (but not special characters)
  2. IDs do contain both special characters and utf-8
  3. Speed does not matter (for now at least)
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
jimkont
  • 131
  • 5

2 Answers2

1

As oneliner with result:

for pattern in $(awk '{print $3}' wrong.csv) ; do tre-agrep -B $pattern available.csv | tr "\n" "$"; echo ; done  
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$

better readable:

for pattern in $(awk '{print $3}' wrong.csv) 
do
  tre-agrep -B $pattern available.csv | tr "\n" "$"
  echo
done  

Something like that?

user unknown
  • 10,267
  • 3
  • 35
  • 58
  • This didn't work but it inspired me to work it out this way. I'll post the solution later. I 'd vote your answer but I don't have enough reputation ;) – jimkont Apr 14 '11 at 06:38
0

I changed the input files to tsv and used the following solution (inspired from 1st answer)

echo "" > wrong_variables.tmp  
while read line  
do  
    var_template=`echo $line | cut -f2`  
    var_parameter=`echo $line | cut -f3`  

    #TODO order by template and cache grep output  
    grep "${var_template}" templ2.tmp  | cut -f2 > tmpfile  
    var_suggest=`tre-agrep -B "$var_parameter" tmpfile | tr "\n" "$"`  

    echo $line \\t $var_suggest >> wrong_variables.tmp
done < $OUTPUT_RAW
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
jimkont
  • 131
  • 5
  • 1
    I haven't read your scripts in detail, but here are a few notes on coping with special characters. Set `IFS=''` and use `read -r line` to avoid skipping initial whitespace and interpreting backslashes. **Always double quote variable substitutions** (`"$line"`) to keep whitespace and `\[?*` unchanged; and use `printf %s "$line"` rather than `echo`. But in fact you can extract tab-separated fields in the shell: `tmp=${line#*␉}; var_template=${tmp%%␉*}; line=${tmp#*␉}; var_parameter=${tmp%%␉*}` where ␉ is a tab. The last line in the loop should be `printf '%s\t%s\n' "$line" "$var_suggest"`. – Gilles 'SO- stop being evil' Apr 14 '11 at 21:04