16

I am using tre-agrep (manpage), an implementation of agrep (manpage), to perform approximate pattern matching. This utility searches for matches based on Levenshtein distance, and the user can configure the penalty applied for substitution, insertion, or deletion edits.

I would like, however, to apply weighs differentially across the length of the query, namely with a lower weight for deletions at the beginning (left end) of the query than at the right. The man page for this utility does not indicate that such a level of control is possible.

Are there other command line tools where approximate matching with finer control over the mismatch penalties is possible?

Peter Gerhat
  • 1,202
  • 5
  • 17
  • 30
user001
  • 3,598
  • 5
  • 39
  • 54
  • 5
    AFAIK, agrep is the only one. I'm surprised you even know about it, given its relative obscurity in the UNIX world (which is too bad). In theory, you can adjust these weights in the source code, but whether or not that is practical, I don't know. Have you tried contacting the authors of the tools or even the original papers on which they are based? Mind you, they're probably old farts now :) – Otheus Nov 09 '15 at 16:25
  • 3
    @Otheus Old farts are still able to write code ;-) – Kusalananda Jul 23 '16 at 15:14
  • It would not be difficult to write a Levenshtein-matching utility with insert/delete/replace costs defined as expressions in Python or Awk. The tedious part, really, is all the possible command-line options. If the OP is willing to show a typical command line, and tell which options of `agrep` they actually need, I could probably whip up something. Calculating the Levenshtein distance of two strings is very easy, really. I'd suggest a shell script wrapped around GNU awk invocation. – Nominal Animal Sep 26 '16 at 23:08

1 Answers1

1

No. That kind of customization falls outside the scope of a Linux tool and into the scope of writing your own code. Using a popular high level language (Java, JavaScript, Python, Perl) will use a bit more memory than C and be a bit slower for scripted languages but likely that will be negligible for your use case. So re-ask on stackoverflow with the exact details you need and someone might offer you a one liner.

user1133275
  • 5,488
  • 1
  • 19
  • 37