1

I have a file full of long SQL queries, one per line. I need to create a list of unique queries, but most of the queries include parameter values that make using an exact matching tool like uniq impossible. Is there a way to find unique lines "fuzzily", like agrep?

Matt Alexander
  • 725
  • 1
  • 7
  • 20
  • 4
    It would be handy if you can show some sample input data, and what result you expect. Did you try anything that came close? – Bernhard Apr 29 '14 at 11:55
  • aaa, aab, abb, bbb have only one difference between one and the next. Should they be considered all the same? Which one to pick? – Stéphane Chazelas Apr 29 '14 at 12:08
  • It should be more like a percentage. For example, those would all be different if the percentage was set to 5% difference. – Matt Alexander Apr 29 '14 at 12:22
  • 3
    That would still be the same. _approximative comparison_ is not a transitive order (a = b and b = c doesn't necessarily mean a = c if the condition to be equal is that there be less than 5% difference, so if you have a = b and b = c (but possibly a != c), which one(s) should that _`uniq`_ return?). It would definitely help to have a sample input and expected output. – Stéphane Chazelas Apr 29 '14 at 12:41
  • 1
    You might have better luck on http://dba.stackexchange.com/ for an actual DB solution to this problem. I would recast your Q asking it in DB terms and leave this one here asking for a solution using `uniq`, `agrep`, etc. since there are 2 paths to take for you, there seems to be 2 potential Q's. Just don't copy/paste this verbatim there or you'll likely get them both closed as cross-posts which isn't discouraged on SE sites. Also you might want to show some of these queries so we can better assist you. – slm Apr 30 '14 at 13:23

2 Answers2

1

If the queries are predictable enough, maybe you could simply sed out the parameter values--e.g. if many queries contain equality comparison with numbers, sed 's/=[[:digit:]]+//g' would remove all the actual numbers, leaving only the column names.

Otherwise, the only really general solutions I can think of are pattern recognition techniques like k-nearest neighbors, which can classify arbitrary lists of strings into clusters based on similarity.

Aaron Davies
  • 274
  • 1
  • 5
1

You might have luck running every query through EXPLAIN ANALYZE and finding unique results in the query plans.

Dan Stoner
  • 94
  • 2