I have a file full of long SQL queries, one per line. I need to create a list of unique queries, but most of the queries include parameter values that make using an exact matching tool like uniq impossible. Is there a way to find unique lines "fuzzily", like agrep?
Asked
Active
Viewed 178 times
1
Matt Alexander
- 725
- 1
- 7
- 20
-
4It would be handy if you can show some sample input data, and what result you expect. Did you try anything that came close? – Bernhard Apr 29 '14 at 11:55
-
aaa, aab, abb, bbb have only one difference between one and the next. Should they be considered all the same? Which one to pick? – Stéphane Chazelas Apr 29 '14 at 12:08
-
It should be more like a percentage. For example, those would all be different if the percentage was set to 5% difference. – Matt Alexander Apr 29 '14 at 12:22
-
3That would still be the same. _approximative comparison_ is not a transitive order (a = b and b = c doesn't necessarily mean a = c if the condition to be equal is that there be less than 5% difference, so if you have a = b and b = c (but possibly a != c), which one(s) should that _`uniq`_ return?). It would definitely help to have a sample input and expected output. – Stéphane Chazelas Apr 29 '14 at 12:41
-
1You might have better luck on http://dba.stackexchange.com/ for an actual DB solution to this problem. I would recast your Q asking it in DB terms and leave this one here asking for a solution using `uniq`, `agrep`, etc. since there are 2 paths to take for you, there seems to be 2 potential Q's. Just don't copy/paste this verbatim there or you'll likely get them both closed as cross-posts which isn't discouraged on SE sites. Also you might want to show some of these queries so we can better assist you. – slm Apr 30 '14 at 13:23
2 Answers
1
If the queries are predictable enough, maybe you could simply sed out the parameter values--e.g. if many queries contain equality comparison with numbers, sed 's/=[[:digit:]]+//g' would remove all the actual numbers, leaving only the column names.
Otherwise, the only really general solutions I can think of are pattern recognition techniques like k-nearest neighbors, which can classify arbitrary lists of strings into clusters based on similarity.
Aaron Davies
- 274
- 1
- 5
1
You might have luck running every query through EXPLAIN ANALYZE and finding unique results in the query plans.
Dan Stoner
- 94
- 2