3

I need to get a clean txt document and my first approach is to use aspell. The issue is I need it on batch, no interactive mode. Every txt file is piped to aspell and must be returned a new document with the non-dictionnary words deleted.

I've found just the inverse behaviour: list the non-dictionary words using

cat $file | aspell list | sort -u -f 

Is aspell the correct tool to achieve that cleaned document folder? What about automatic substitution of misspelled words? (using a predefined list file)

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
jomaweb
  • 521
  • 1
  • 3
  • 12

1 Answers1

1
sed -E -e "s/$(aspell list <file | sort -u | paste -s -d'|' |
               sed -e 's/^/\\b(/; s/$/)\\b/' )//g" \
    file > newfile

This uses command substitution $(...) to insert the output of aspell list <$file into a sed search and replace operation.

aspell's output is also unique sorted and paste is used to join each line with |. Finally it is piped through sed to add \b word-boundary anchors as well as open and close parentheses. All of which constructs a valid extended regular expression like \b(word1|word2|word3|...)\b to use as the search regexp in the sed search and replace command.

You can test the result of the entire command with, e.g., diff -u file newfile

AFAIK, aspell doesn't have an auto-correct mode. This is probably a Good Thing.

cas
  • 1
  • 7
  • 119
  • 185
  • Hi cas, tested your code but the file comes out untouched – jomaweb May 12 '16 at 15:26
  • Try the updated version. The first had two problems - 1. `aspell` reads from stdin, not a file 2. `grep -v` would never have done what you want, it would have removed the entire line on any match, not just the matching word. – cas May 13 '16 at 00:01
  • Updated version just strips words but is ripping apart some words that are contained inside too: vg. citizenship would be converted to citizen if ship is not in dictionnary. That is too bad – jomaweb May 13 '16 at 11:48
  • ok, that just means the regexp needs to be further modified to have word boundary anchors....i really should have thought of that earlier. i'll update my answer. – cas May 13 '16 at 14:24