A combination of uniq and agrep?

Question

I have a file full of long SQL queries, one per line. I need to create a list of unique queries, but most of the queries include parameter values that make using an exact matching tool like uniq impossible. Is there a way to find unique lines "fuzzily", like agrep?

It would be handy if you can show some sample input data, and what result you expect. Did you try anything that came close? — Bernhard, Apr 29 '14 at 11:55
aaa, aab, abb, bbb have only one difference between one and the next. Should they be considered all the same? Which one to pick? — Stéphane Chazelas, Apr 29 '14 at 12:08
It should be more like a percentage. For example, those would all be different if the percentage was set to 5% difference. — Matt Alexander, Apr 29 '14 at 12:22
That would still be the same. _approximative comparison_ is not a transitive order (a = b and b = c doesn't necessarily mean a = c if the condition to be equal is that there be less than 5% difference, so if you have a = b and b = c (but possibly a != c), which one(s) should that _`uniq`_ return?). It would definitely help to have a sample input and expected output. — Stéphane Chazelas, Apr 29 '14 at 12:41
You might have better luck on http://dba.stackexchange.com/ for an actual DB solution to this problem. I would recast your Q asking it in DB terms and leave this one here asking for a solution using `uniq`, `agrep`, etc. since there are 2 paths to take for you, there seems to be 2 potential Q's. Just don't copy/paste this verbatim there or you'll likely get them both closed as cross-posts which isn't discouraged on SE sites. Also you might want to show some of these queries so we can better assist you. — slm, Apr 30 '14 at 13:23

score 1 · Accepted Answer · answered Apr 29 '14 at 15:49

If the queries are predictable enough, maybe you could simply sed out the parameter values--e.g. if many queries contain equality comparison with numbers, sed 's/=[[:digit:]]+//g' would remove all the actual numbers, leaving only the column names.

Otherwise, the only really general solutions I can think of are pattern recognition techniques like k-nearest neighbors, which can classify arbitrary lists of strings into clusters based on similarity.

This is exactly what I ended up doing. – Matt Alexander Apr 30 '14 at 14:48 — Matt Alexander, Apr 30 '14 at 14:48

score 1 · Answer 2 · answered Apr 30 '14 at 12:33

1

You might have luck running every query through EXPLAIN ANALYZE and finding unique results in the query plans.

answered Apr 30 '14 at 12:33

Dan Stoner

94
2

A combination of uniq and agrep?

2 Answers2