I have a directory that contains numerous files. Moreover, I forget exact file name. So, when I want to find a file I don't find it.
If there is any tool that use soundex algorithm for searching that is helpful for my case.
I have a directory that contains numerous files. Moreover, I forget exact file name. So, when I want to find a file I don't find it.
If there is any tool that use soundex algorithm for searching that is helpful for my case.
This is an answer written for my own curiosity. You should probably build something out of the suggestions in the answers to "Is there a Unix command that searches for similar strings, based mostly on how they sound when spoken?" (the Perl Text::Soundex module) instead of using this.
The following shell script and accompanying sed script does a Soundex filename search in the directory tree rooted at the current directory given a search string on the command line.
$ sh soundex.sh fissbux
./fizzbuzz
./fizzbuzz.c
./fizzbuzz2
./fizzbuzz2.c
$ sh soundex.sh sharlok
./HackerRank/Algorithms/02-Implementation/17-sherlock_and_squares.c
$ sh soundex.sh sundek
./soundex.sh
./soundex.sed
The shell script (soundex.sh):
#!/bin/sh
soundex=$( printf '%s\n' "$1" | tr 'a-z' 'A-Z' | sed -f soundex.sed )
find . -exec bash -c '
paste <( printf "%s\n" "${@##*/}" | tr "a-z" "A-Z" | sed -f soundex.sed ) \
<( printf "%s\n" "$@" ) |
awk -vs="$0" "\$1 == s" | cut -f 2-' "$soundex" {} +
The script calculates the soundex value for the search term using the sed script (below). It then uses find to find all names in the current directory or below and calculates the soundex value for each in the same way as for the search term. If a soundex value for a filename matches that of the search term, the full path to that file is printed.
I admit that the shell script is a bit basic. For example, it may be improved by adding the absolute path to the soundex.sed script. As it is written now, it requires that the sed script is in the current directory. It also does not support filenames containing newlines.
The sed script (soundex.sed):
s/[^[:alpha:]]//g
h
s/^\(.\).*$/\1/
x
y/bfpvBFPVcgjkqsxzCGJKQSXZdtDTlLmnMNrR/111111112222222222222222333344555566/
s/\([1-6]\)[hwHW]\1/\1/g
s/\([1-6]\)\1\1*/\1/g
s/[aeiouyhwAEIOUYHW]/!/g
s/^.//
H
x
s/\n//
s/!//g
s/^\(....\).*$/\1/
s/^\(...\)$/\10/
s/^\(..\)$/\100/
s/^\(.\)$/\1000/
This implements "American Soundex" as described in Wikipedia. It does not modify the initial character (apart from deleting it if it's not alphabetic), which is why I uppercase the strings with tr in the shell script.
This has not been thoroughly tested, but seems to correctly handle the names mentioned in the Wikipedia article.
Annotated version (the "steps" refers to the steps in the abovementioned Wikipedia article):
# Remove non-alphabetic characters
s/[^[:alpha:]]//g
# STEP 1 (part 1: retain first character)
# Save whole line in hold-space
h
# Delete everything but the first character and swap with hold-space
s/^\(.\).*$/\1/
x
# The hold-space now contains only the first character
# STEP 2
y/bfpvBFPVcgjkqsxzCGJKQSXZdtDTlLmnMNrR/111111112222222222222222333344555566/
# STEP 3
s/\([1-6]\)[hwHW]\1/\1/g
s/\([1-6]\)\1\1*/\1/g
# STEP 1 (part 2: remove vowels etc.)
# We don't actually remove them but "mask" them with "!"
# This avoids accidentally deleting the first character later
s/[aeiouyhwAEIOUYHW]/!/g
# Replace first character with the one saved in the hold-space
# Delete first character
s/^.//
# Append pattern-space to hold-space and swap
H
x
# Remove newline inserted by "H" above and all "!" (old vowels etc.)
s/\n//
s/!//g
# STEP 4
s/^\(....\).*$/\1/
s/^\(...\)$/\10/
s/^\(..\)$/\100/
s/^\(.\)$/\1000/
Searching with soundex values mostly comes down to luck.
Also:
$ paste <( printf '%s\n' * | sed -f soundex.sed ) <( printf '%s\n' * )
F236 Factorio
F230 Fasta
G500 Game
H265 HackerRank
K200 KEYS
L210 Lisp
P625 Parsing
P315 Pathfinder
P315 Pathfinder.tar.xz
Q000 QA
R165 Reformat
R123 Repositories
R564 RimWorld
S613 Scripts
U523 UNIX.dot
U521 UNIX.png
U523 UNIX.txt
W620 Work
a526 answers.txt
c313 cat-food-schedule.txt
f212 fizzbuzz
f212 fizzbuzz.c
f212 fizzbuzz2
f212 fizzbuzz2.c
p363 poetry.txt
q235 questions.txt
r200 rc
s532 soundex.sed
s532 soundex.sh
u313 utp-1.0.tar.gz