EDIT: the files were changed to tsv to deal better with spaces in text fields
I have 2 csv files in the following form:
File 1: availableText.csv (can be very big)
"id1" , "text1-1"
"id1" , "text1-2"
"id1" , "text1-3"
"id1" , "text1-4"
"id2" , "text2-1"
"id2" , "text2-2"
"id2" , "text2-3"
"id2" , "text2-4"
...
File 2: wrongText.csv
"id1" , "texta"
"id2" , "textb"
"id3" , "textc"
"id4" , "textd"
...
For every line in wrongText.csv, I want to filter the available text entries for the same id and suggest the best available option using tre-agrep (a grep-like functions that allows error in the pattern and using -B returns the best match)
For example, for id1:
tre-agrep -B 'texta' (from text1-1:4) | tr "\n" "$"
( will produce something like 'text1-2$text1-4' )
The desired output file would be like this:
"id1" , "texta" , "text1-2$text1-4"
"id2" , "textb" , "text2-1$text2-3$text2-4"
Note:
- The CSV can be converted to any format, but text may contain spaces (but not special characters)
- IDs do contain both special characters and utf-8
- Speed does not matter (for now at least)