I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?
Asked
Active
Viewed 1,870 times
4
-
2It really depends on how the OCR was integrated in the PDF file. Manually: Install `mupdf`, use `mutool clean -d -i -f input.pdf output.pdf` to decompress page streams, load into text editor, figure out the structure (read PDF specification), remove pages (or write script to remove them), then `mutool -z` to compress again. Needs some practice . The file won't get much lighter, the images take the most space, so it's probably too difficult/too much effort to be worth doing it. – dirkt Jun 12 '17 at 07:59
-
@dirkt, Thanks for the comment. Indeed, the file didn't get much lighter, it shrunk from 8MB to 7.7MB. I also tried to take out every image from the original file and then merge the images again to remove the metadata and text layer; and the size reduction was the same. But at least my ereader stopped showing that annoying spaghettish OCRed text. – Seninha Jun 13 '17 at 14:56
1 Answers
1
The command given by @dirkt didn't work for me and infact it decreased file size from 560Mb to 300 & some Mb but I didn't check with diffpdf so don't know what changed between the files.
What worked for me is Apache Pdfbox and Pdfbox developers have provided a nice little program in examples to remove text and for other things, but since I don't have any experience with java (or anything except bash for that matter) what I did was install openjdk-11-jdk-headless and libpdfbox-java.
Steps:
- Copy pdfbox2.jar, fontbox2.jar, commons-logging.jar (needed by some class in pdfbox2) to a folder.
- Extract Jar files e.g.
jar xf pdfbox2.jar. - Get the Pdfbox source for same version as installed.
- Copy RemoveAllText.java to the folder
org/apache/pdfbox/examples/util. - Compile RemoveAllText.java
javac org/apache/pdfbox/examples/util/RemoveAllText.java. - Now you can run it, this will show usage
java org.apache.pdfbox.examples.util.RemoveAllText.
If someone comes across this answer and knows better way to do this please comment.
harshit
- 11
- 1