17

Is there a command-line tool to text-search a docx file? I tried grep, but it doesn't work with docx even though it works fine with txt and xml files. I could convert the docx to txt first, but I'd prefer a tool that operates directly on docx files. I need the tool to work under Cygwin.

OP edit: Later I found out that the easiest way to achieve the grep is actually to convert those docx to txt then grep over them.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
RoundPi
  • 271
  • 1
  • 2
  • 5

5 Answers5

4

I know of several indexing tools that support Word documents. Such tools allow you to index documents, then efficiently search words in the index. They don't permit full text searches.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
4

My grep solution as a function you can paste in your .bashrc

docx_search(){ local arg wordfile terms=() root=${root:-/}; for arg; do terms+=(-e "$arg"); done; find 2>/dev/null "${root%/}/" -iname '*.docx' -exec bash -c "$(declare -p terms)"'; for arg; do unzip -p "$arg" 2>/dev/null | grep --quiet --ignore-case --fixed-strings "${terms[@]}" && printf %s\\n "$arg"; done' _ {} +; }

It will look for any (case insensitive) occurence of its arguments and print the matching docx file location.


Examples:

$ docx_search 'my example sentence'
/cygdrive/d/example sentences.docx
/cygdrive/c/Users/my user/Documents/example sentences.docx
$ root='/cygdrive/c/Users/my user/' docx_search 'seldom' 'full sentence'
/cygdrive/c/Users/my user/Documents/example sentences.docx
$ 

Readable version:

docx_search(){
  local arg wordfile terms=() root=${root:-/}
  # this 'root' assignment allows you to search in a specific location like /cygdrive/c/ instead of everywhere on the machine
  for arg; do terms+=(-e "$arg"); done
  # We inject the terms to search inside the string with declare -p`
  find 2>/dev/null "${root%/}/" -iname '*.docx' -exec \
    bash -c "$(declare -p terms)"';
      for arg; do
        unzip -p "$arg" 2>/dev/null |
          grep --quiet --ignore-case --fixed-strings "${terms[@]}" &&
          printf %s\\n "$arg"
      done' _ {} +
}
Camusensei
  • 193
  • 6
2

DOCx is compressed and it is not a text format. So what you need is a converter first. After that you can use the find command on the converted file(s).

Nils
  • 18,202
  • 11
  • 46
  • 82
  • Or you can use a search tool that can read inside compressed files. In your last sentence, I suppose you meant `grep`? – Gilles 'SO- stop being evil' Jan 06 '12 at 23:32
  • @Gilles - look at the original title of the question before Michael edited it. This seemed to be a question about DOS (and I flagged it off-topic). – Nils Jan 07 '12 at 20:14
0

Have you looked at openoffice ninja?
(don't know about cygwin support)

bsd
  • 10,916
  • 4
  • 30
  • 38
0

Here's an updated version optimized for performance.

It requires ripgrep and fd-find. Here's how to install them if you do not have them.

fd-find:

sudo apt install fd-find

ripgrep:

curl -LO https://github.com/BurntSushi/ripgrep/releases/download/13.0.0/ripgrep_13.0.0_amd64.deb
sudo apt install ./ripgrep_13.0.0_amd64.deb

Paste this in your .bashrc:


docxgrep() {
    
    keyword="$1"
    
    /usr/bin/fdfind -t f -e docx . | while read -r arg; do
        if unzip -p "$arg" 2>/dev/null | rg -q  --ignore-case --fixed-strings "$keyword"; then
            echo "$arg"
        fi
    done
}

Run source ~/.bashrc Now we can search:

$ docxgrep 'hello'        
./Document.docx
Cyrill
  • 1