Os ubuntu.
Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.
Tried pdftotext, but seems links are not exported, pdfgrep is the same.
Is there any solution?
Thanks.
Os ubuntu.
Need to get links or more data (for example binding layer from QuarkXPress application) from pdf to text, in terminal.
Tried pdftotext, but seems links are not exported, pdfgrep is the same.
Is there any solution?
Thanks.
You could try and extract the /URI(...) PDF directives by hand, maybe after removing compression if any using pdftk:
pdftk file.pdf output - uncompress | grep -aPo '/URI *\(\K[^)]*'
Test this:
pdftotext -raw "filename.pdf" && file=`ls -tr | tail -1`; grep -E "https?://.*" "${file}" && rm "${file}"
First, you need to check if your pdf is compressed or not, see:
How to know if a PDF file is compressed or not and to (un)compress it
If it's compressed, you need to uncompress it.
Then, you can extract links using grep and sed:
strings uncompressed.pdf | grep -Eo '/URI \(.*\)' | sed 's/^\/URI (//g; s/)$//g'