3

For the purpose of processing pdf with Scan Tailor (in order to remove the background of photographed pdf pages, or to split pdf pages) given that this program needs input of images - it cannot input pdf as such - I have used a command like pdftoppm MY_PDF NAME_OF_IMAGE -png to process a low quality pdf, and the resulting images were worse than the original pdf.

enter image description here

But with pdfimages tool from poppler-utils the results are as good as the original.

This stays true if a different variable than -png is used (or if no variable is used and the output is ppm).

I thought that from now on pdfimages is a better solution for my purpose, but then I have noticed that for many other pdf files that command is not good at all, as it gives fragments of image or text where pdftoppm gives normal text as expected.

Bad images if extracted from pdf with pdfimages viewed in Dolphin:

enter image description here

Correct images if extracted from the same pdf with pdftoppm viewed in Dolphin:

enter image description here

Why these differences?

cipricus
  • 1,386
  • 13
  • 42

1 Answers1

10

The difference arises from the purpose of the tools. It becomes apparent once you realize that PDF is a flexible file format. It can contain text, vector graphics and raster images (this list is not exhaustive). You may think of it as "zip with layout information" (gross simplification).

  • pdftoppm will "render" or "rasterize" the entire PDF. All text and graphics will become one rasterized output image.
    Since the embedded raster images' pixels rarely line up with the pixels of the output "canvas", interpolation happens and quality decreases. This can be counteracted by increasing the output resolution (option -r) significantly. Of course, this means the file-size will grow, too.
  • pdfimages will extract the raster images from the PDF file. Text or vector graphics are disregarded.
    Since the raster images are extracted as they are, the original quality is preserved, but the information regarding the layout is lost.

The output may look similar if your input PDF contains exactly one raster image and nothing else.

In your examples, the photocopier's scanning function tried to identify blocks of text to store them with high quality. The rest of the document (e.g. white background) is stored in low quality to save storage space. As you found out, this may or may not work in one's favour.

Hermann
  • 5,789
  • 2
  • 17
  • 32
  • Upvoted, but your explanation at the end sounds wrong. There is no need to store white background in low quality, practically all of the size of a monochrome picture of text is because of the text, the whitespaces compresses very well. Cutting it into pieces probably increases size rather than decrease it. – Nobody Oct 23 '22 at 17:13
  • 1
    Canon scanners refer to this mode as "PDF (Compact)". Xerox machines apply similar techniques. To be fair, I never verified the claim. I blindly assume they know what they are doing. – Hermann Oct 23 '22 at 20:14