Is there some sort of PDF-to-text converter?

Question

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?

Perhaps related post, OCR with Ubuntu here.

[Similar question at Super User](http://superuser.com/questions/163182/command-line-tool-to-search-phrases-in-large-number-of-pdf-files) — Gilles 'SO- stop being evil', Dec 11 '10 at 17:25
If it is a "real" PDF (made from text, etc) pdftotext is your best bet. If it is an image, your best bet is some OCR stuff. — vonbrand, Jan 16 '13 at 01:22
[similar question at askubuntu](https://askubuntu.com/q/52040/78103) — Trevor Boyd Smith, May 01 '18 at 12:18
You can uncompress them see https://unix.stackexchange.com/a/17713/8337 — rogerdpack, Mar 14 '21 at 06:36

score 43 · Answer 1 · answered Dec 11 '10 at 16:26

43

You have a lot of options!

pdftotext from poppler has already been mentioned.

There's a Haskell program called pdf2line which works well.

calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.

ebook-convert file.pdf file.txt

AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:

abiword --to=txt file.pdf

Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.

If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.

I can actually think of a few more methods, but I'll leave it at that for now. ;)

answered Dec 11 '10 at 16:26

frabjous

8,421
1
32
33

1

calibre's ebook-convert... have you *seen* what it does to ligatures? bleargh. let's put it this way: it's not a very e ective program. pdftotext is much more faithful. i have never discovered any errors in its output. – ixtmixilix Feb 12 '12 at 01:23
2

You can use [less](http://unixhelp.ed.ac.uk/CGI/man-cgi?less) for viewing pdf-files as text. It invokes a preprocessor, i.e. lesspipe, for invoking pdftotext or similar tools. – Daniel Näslund Mar 13 '12 at 11:37
`pdftotext` gives more accurate results than `ebook-convert` and it is very fast. `ebook-convert` is sluggish. – Amit Patel May 26 '15 at 09:56
`pdftotext` with `-layout` option rocks! `calibre` requires more than 600mb to install! That's crazy ) – Stalinko Nov 15 '18 at 06:14

Gilles 'SO- stop being evil' · Answer 2 · 2015-10-17T18:13:56.733

You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).

You can use Recoll (Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.

Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).

score 4 · Answer 3 · answered Dec 11 '10 at 14:57

4

pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.

answered Dec 11 '10 at 14:57

jlliagre

60,319
10
115
157

Find pdftotext examples at [PDF to TEXT open source command line tool](http://superuser.com/questions/294725/pdf-to-text-open-source-command-line-tool) & [How to convert all pdf files to text (within a folder) with one command?](http://askubuntu.com/questions/211870/how-to-convert-all-pdf-files-to-text-within-a-folder-with-one-command). – kenorb Aug 01 '14 at 09:24

score -1 · Answer 4 · answered Aug 07 '14 at 16:32

-1

gPDFText converts ebook PDF content into ASCII text, reformatted for long line paragraphs, It works for me and it has a graphical interface.

answered Aug 07 '14 at 16:32

Charles

1

3

Hi and welcome to the site. We like answers to be a bit more comprehensive here. For example, you could add where `gPDFText` can be obtained, how it can be installed and how it would be used to answer the OP's question. – terdon Aug 07 '14 at 16:45

Is there some sort of PDF-to-text converter?

4 Answers4

Linked

Related