4

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to get rid of the unnecessary OCR)?

Seninha
  • 1,035
  • 1
  • 9
  • 17
  • 2
    It really depends on how the OCR was integrated in the PDF file. Manually: Install `mupdf`, use `mutool clean -d -i -f input.pdf output.pdf` to decompress page streams, load into text editor, figure out the structure (read PDF specification), remove pages (or write script to remove them), then `mutool -z` to compress again. Needs some practice . The file won't get much lighter, the images take the most space, so it's probably too difficult/too much effort to be worth doing it. – dirkt Jun 12 '17 at 07:59
  • @dirkt, Thanks for the comment. Indeed, the file didn't get much lighter, it shrunk from 8MB to 7.7MB. I also tried to take out every image from the original file and then merge the images again to remove the metadata and text layer; and the size reduction was the same. But at least my ereader stopped showing that annoying spaghettish OCRed text. – Seninha Jun 13 '17 at 14:56

1 Answers1

1

The command given by @dirkt didn't work for me and infact it decreased file size from 560Mb to 300 & some Mb but I didn't check with diffpdf so don't know what changed between the files.

What worked for me is Apache Pdfbox and Pdfbox developers have provided a nice little program in examples to remove text and for other things, but since I don't have any experience with java (or anything except bash for that matter) what I did was install openjdk-11-jdk-headless and libpdfbox-java.

Steps:

  1. Copy pdfbox2.jar, fontbox2.jar, commons-logging.jar (needed by some class in pdfbox2) to a folder.
  2. Extract Jar files e.g. jar xf pdfbox2.jar.
  3. Get the Pdfbox source for same version as installed.
  4. Copy RemoveAllText.java to the folder org/apache/pdfbox/examples/util .
  5. Compile RemoveAllText.java javac org/apache/pdfbox/examples/util/RemoveAllText.java.
  6. Now you can run it, this will show usage java org.apache.pdfbox.examples.util.RemoveAllText.

If someone comes across this answer and knows better way to do this please comment.

harshit
  • 11
  • 1