I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with Ubuntu here.
I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro?
Perhaps related post, OCR with Ubuntu here.
You have a lot of options!
pdftotext from poppler has already been mentioned.
There's a Haskell program called pdf2line which works well.
calibre's ebook-convert commandline program (or calibre itself) is another option; it can convert PDF to plain text, or other ebook-format (RTF, ePub), in my opinion it generates better results than pdftotext, although it is considerably slower.
ebook-convert file.pdf file.txt
AbiWord can convert between any formats it knows from the command-line, and at least optionally has a PDF import plugin:
abiword --to=txt file.pdf
Yet another option is podofotextextract from the podofo PDF tools library. I haven't really tried that.
If you combine the two Ghostscript tools, pdf2ps and ps2ascii, you have yet another option.
I can actually think of a few more methods, but I'll leave it at that for now. ;)
You can convert PDFs to text on the command line with pdftotext (Ubuntu: poppler-utils; OpenBSD: xpdf-utils package).
You can use Recoll
(Ubuntu: recoll; OpenBSD: no port, but there's one for FreeBSD.) to search inside various formatted text document types, including PDF. There's a GUI, and it builds an index automatically under the hood. It uses pdftotext to convert PDF to text.
Acrobat Reader (at least version 9 under Linux) has a limited multiple-file search capability (you can search in all the files in a directory).
pdftotext is likely what you are looking for: http://en.wikipedia.org/wiki/Pdftotext unless the text you want to extract is really under a graphical form, which is not that common with pdf documents.
gPDFText converts ebook PDF content into ASCII text, reformatted for long line paragraphs, It works for me and it has a graphical interface.