25

I am trying to convert a .docx received by mail to a correct pdf by using pandoc (I am using GNU/Linux).

I have an error concerning characters encoding :

$ pandoc file.docx -o file.pdf
pandoc: Cannot decode byte '\x87': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

I tried to identify the encoding :

$ file -i file .docx 
file.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary

I am a little surprised by charset=binary (I was expecting charset=iso8859-15). However I tried to convert the .docx to utf8 anyway and it is not working :

 $ iconv -t utf-8 file.docx
P!      $iconv: séquence d'échappement non permise à la position 16

I have the same error with the command line from pandoc documentation :

iconv -t utf-8 file.docx | pandoc | iconv -f utf-8

How can I convert this .docx to pdf with pandoc?

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
ppr
  • 1,887
  • 7
  • 23
  • 40
  • Why don't you use [Zamzar](http://www.zamzar.com/) - for a one off... I have to use [Kingsoft](http://wps-community.org/) to edit my work, though it is probably illegal to use in North America... – Wilf Dec 17 '13 at 16:54
  • I suggest providing `iconv` a source character set, using the `-f` flag. For example, `iconv -f ISO-8859-15 -t utf-8 file.docx` might work. No idea what the format of a .docx file is, though. – derobert Dec 17 '13 at 17:00
  • @wilf I tried. The output is not correct (normally, Zamzar does his job very well but not for this file). – ppr Dec 17 '13 at 17:01
  • [here](http://johnmacfarlane.net/pandoc/README.html) docx is not listed as a compatible *input* - so you might have to use something else - libreoffice can do an OK job, but can mess up the formatting sometimes. – Wilf Dec 17 '13 at 17:03
  • 1
    @wilf thanks (pandoc is so powerful sometimes I forget it has limitations). – ppr Dec 17 '13 at 17:07
  • Not everything can do everything ;-) – Wilf Dec 17 '13 at 17:11
  • @wilf please post that as an answer... – derobert Dec 17 '13 at 17:16
  • @wilf and I will accept it. – ppr Dec 17 '13 at 17:16
  • Done it :-).... – Wilf Dec 17 '13 at 17:24
  • 3
    @derobert: Running `iconv` directly on a `.docx` file is unlikely to work. `iconv` assumes that its input is a *text* file in some specified or inferred format. A `.docx` file is actually a zip file (a compressed archive) containing (mostly) xml files. You might conceivably have some luck unzipping the `.docx` file, running `iconv` on the constituent files, and then re-zipping everything back into a new `.docx`, but I wouldn't bet on it working. For one thing, the xml file containing the actual content of the document specifies its encoding: `encoding="UTF-8"`, for example. – Keith Thompson Dec 17 '13 at 20:55

2 Answers2

21

In the documentation here, .docx is not listed as a compatible input:

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read markdown and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, Haddock markup, OPML, and DocBook; and it can write plain text, markdown, reStructuredText, XHTML, HTML 5, LaTeX (including beamer slide shows), ConTeXt, RTF, OPML, DocBook, OpenDocument, ODT, Word docx, GNU Texinfo, MediaWiki markup, EPUB (v2 or v3), FictionBook2, Textile, groff man pages, Emacs Org-Mode, AsciiDoc, and Slidy, Slideous, DZSlides, reveal.js or S5 HTML slide shows. It can also produce PDF output on systems where LaTeX is installed.

Try something else, like Libreoffice - which can do docx, aslong as you don't mind a few formatting errors.

EDIT:

The description now says that Pandoc now seems to support reading from Word DOCX (as well as DocBook and a few other formats):

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read markdown and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, Haddock markup, OPML, Emacs Org-mode, DocBook, txt2tags, EPUB and Word docx; and it can write plain text, markdown, reStructuredText, XHTML, HTML 5, LaTeX (including beamer slide shows), ConTeXt, RTF, OPML, DocBook, OpenDocument, ODT, Word docx, GNU Texinfo, MediaWiki markup, DokuWiki markup, Haddock markup, EPUB (v2 or v3), FictionBook2, Textile, groff man pages, Emacs Org-Mode, AsciiDoc, InDesign ICML, and Slidy, Slideous, DZSlides, reveal.js or S5 HTML slide shows. It can also produce PDF output on systems where LaTeX is installed.


As @evilsoup suggested, this might work:

cd /DIRECTORY/WITH/FILE/IN && libreoffice --headless --convert-to html 'FILE.docx' && pandoc 'FILE.html' -o 'FILE.pdf'

Yes, you can use the libreoffice command with --outdir, but the html output does not always work that way...

I gave this a quick test, and it seemed to work, apart from Pandoc crashing due to a gif image in the document smiley

Wilf
  • 2,337
  • 2
  • 22
  • 39
  • 1
    Um.... **Word docx** is right there in your quoted text (right after OpenDocument and ODT). That said, docx is still not a well documented format and so, actual compatibility in the open world is.... spotty, shall we say, and your suggestion for LibreOffice (along with the formatting *differences* ) is good. – SuperMagic Dec 17 '13 at 19:27
  • @SuperMagic - it is, in the bit it can **write** to... Hightlighted it to make it easier. – Wilf Dec 17 '13 at 20:21
  • 1
    If you *really* want a pandoc-style (actually LaTeX-made) PDF, you can also use LibreOffice to convert the docx to html, and then use that as input for pandoc (depending on the competence of the person who made the original document, you may need to remove a bunch of `
    `s from the html).
    – evilsoup Dec 17 '13 at 22:47
  • LibreOffice can do docx as long as you don't mind butchered formatting. – Gilles 'SO- stop being evil' Dec 18 '13 at 00:14
  • Butchered - or non-existent... @Gilles :-) – Wilf Dec 18 '13 at 09:59
  • 1
    one can use direct libreoffice pdf export: `libreoffice --headless --convert-to pdf inputfile.docx` – andrej May 22 '14 at 07:52
  • 2
    On OSX, the executable is called soffice and can be found in /Applications/LibreOffice.app/contents/MacOS/bin. Further details can be found here: http://ask.libreoffice.org/en/question/12084/how-to-convert-documents-to-pdf-on-osx/ – Tim Saylor Jan 21 '15 at 17:41
  • 3
    Pandoc now lists Word docx as a supported format in the documentation. – cledoux Apr 13 '15 at 14:43
16

This still comes up on google searches so I wanted to put this on the record: pandoc could not read docx when this question was asked (the error comes from trying to read a binary file) but since version 1.13 it can, and it does a pretty good job of it.

jkr
  • 161
  • 1
  • 2
  • 2
    Pandoc does not preserve the original design formatting, however. See this post: https://github.com/jgm/pandoc/issues/2206#issuecomment-107994587 – orschiro Jun 02 '15 at 15:51