40

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?

Perhaps related post, OCR with Ubuntu here.

Garrett
  • 167
  • 7
otto
  • 501
  • 1
  • 4
  • 3

4 Answers4

43

You have a lot of options!

pdftotext from poppler has already been mentioned.

There's a Haskell program called pdf2line which works well.

calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.

ebook-convert file.pdf file.txt

AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:

abiword --to=txt file.pdf

Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.

If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.

I can actually think of a few more methods, but I'll leave it at that for now. ;)

frabjous
  • 8,421
  • 1
  • 32
  • 33
  • 1
    calibre's ebook-convert... have you *seen* what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output. – ixtmixilix Feb 12 '12 at 01:23
  • 2
    You can use [less](http://unixhelp.ed.ac.uk/CGI/man-cgi?less) for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools. – Daniel Näslund Mar 13 '12 at 11:37
  • `pdftotext` gives more accurate results than `ebook-convert` and it is very fast. `ebook-convert` is sluggish. – Amit Patel May 26 '15 at 09:56
  • `pdftotext` with `-layout` option rocks! `calibre` requires more than 600mb to install! That's crazy ) – Stalinko Nov 15 '18 at 06:14
10

You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).

You can use Recoll (Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.

Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
4

pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.

jlliagre
  • 60,319
  • 10
  • 115
  • 157
  • Find pdftotext examples at [PDF to TEXT open source command line tool](http://superuser.com/questions/294725/pdf-to-text-open-source-command-line-tool) & [How to convert all pdf files to text (within a folder) with one command?](http://askubuntu.com/questions/211870/how-to-convert-all-pdf-files-to-text-within-a-folder-with-one-command). – kenorb Aug 01 '14 at 09:24
-1

gPDFText converts ebook PDF content into ASCII text, reformatted for long line paragraphs, It works for me and it has a graphical interface.

  • 3
    Hi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where `gPDFText` can be obtained, how it can be installed and how it would be used to answer the OP's question. – terdon Aug 07 '14 at 16:45