Generate suggestions for errors from a lookup dictionary

Question

EDIT: the files were changed to tsv to deal better with spaces in text fields

I have 2 csv files in the following form:

File 1: availableText.csv (can be very big)

"id1" , "text1-1"
"id1" , "text1-2"
"id1" , "text1-3"
"id1" , "text1-4"
"id2" , "text2-1"
"id2" , "text2-2"
"id2" , "text2-3"
"id2" , "text2-4"
...

File 2: wrongText.csv

"id1" , "texta"
"id2" , "textb"
"id3" , "textc"
"id4" , "textd"
...

For every line in wrongText.csv, I want to filter the available text entries for the same id and suggest the best available option using tre-agrep (a grep-like functions that allows error in the pattern and using -B returns the best match)

For example, for id1:

tre-agrep -B 'texta' (from text1-1:4) | tr "\n" "$"
( will produce something like 'text1-2$text1-4' )

The desired output file would be like this:

"id1" , "texta" , "text1-2$text1-4"
"id2" , "textb" , "text2-1$text2-3$text2-4"

Note:

The CSV can be converted to any format, but text may contain spaces (but not special characters)
IDs do contain both special characters and utf-8
Speed does not matter (for now at least)

Which output produces your tre-agrep command? – user unknown Apr 13 '11 at 17:26 — user unknown, Apr 13 '11 at 17:26
tre-agrep command produces the following: text1-2$text1-4 – jimkont Apr 13 '11 at 18:08 — jimkont, Apr 13 '11 at 18:08

score 1 · Answer 1 · answered Apr 13 '11 at 18:40

As oneliner with result:

for pattern in $(awk '{print $3}' wrong.csv) ; do tre-agrep -B $pattern available.csv | tr "\n" "$"; echo ; done  
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$
"id1" , "text1-1"$"id1" , "text1-2"$"id1" , "text1-3"$"id1" , "text1-4"$"id2" , "text2-1"$"id2" , "text2-2"$"id2" , "text2-3"$"id2" , "text2-4"$

better readable:

for pattern in $(awk '{print $3}' wrong.csv) 
do
  tre-agrep -B $pattern available.csv | tr "\n" "$"
  echo
done

Something like that?

This didn't work but it inspired me to work it out this way. I'll post the solution later. I 'd vote your answer but I don't have enough reputation ;) — jimkont, Apr 14 '11 at 06:38

score 0 · Accepted Answer · edited Apr 14 '11 at 20:56

0

I changed the input files to tsv and used the following solution (inspired from 1st answer)

echo "" > wrong_variables.tmp  
while read line  
do  
    var_template=`echo $line | cut -f2`  
    var_parameter=`echo $line | cut -f3`  

    #TODO order by template and cache grep output  
    grep "${var_template}" templ2.tmp  | cut -f2 > tmpfile  
    var_suggest=`tre-agrep -B "$var_parameter" tmpfile | tr "\n" "$"`  

    echo $line \\t $var_suggest >> wrong_variables.tmp
done < $OUTPUT_RAW

edited Apr 14 '11 at 20:56

Gilles 'SO- stop being evil'

807,993
194
1,674
2,175

answered Apr 14 '11 at 19:29

jimkont

131
5

1

I haven't read your scripts in detail, but here are a few notes on coping with special characters. Set `IFS=''` and use `read -r line` to avoid skipping initial whitespace and interpreting backslashes. **Always double quote variable substitutions** (`"$line"`) to keep whitespace and `\[?*` unchanged; and use `printf %s "$line"` rather than `echo`. But in fact you can extract tab-separated fields in the shell: `tmp=${line#*␉}; var_template=${tmp%%␉*}; line=${tmp#*␉}; var_parameter=${tmp%%␉*}` where ␉ is a tab. The last line in the loop should be `printf '%s\t%s\n' "$line" "$var_suggest"`. – Gilles 'SO- stop being evil' Apr 14 '11 at 21:04

Generate suggestions for errors from a lookup dictionary

File 1: availableText.csv (can be very big)

File 2: wrongText.csv

2 Answers2