Extracting Images using pdfimages: Getting 3 Images Per Page: .jp2, .png, .jb2e

Question

Using pdfimages -all on a .pdf file, each page of which is text, I'm getting 3 images for each page in the pdf:

Foo-001-000.jp2
Foo-001-002.png
Foo-001-002.jb2e

The first file is mostly blank, but contains some ghostly background plus an occasional piece of text. The second file is black and white and appears to be some kind of mask, perhaps identifying where the text in the third file is located (?) The third file I am not able to view in Ubuntu's image viewer or gimp.

If I use -png I similarly get three images, but all are .png's. Most (almost all) of the pdf's text in in the third image.

pdfimages -list looks like this:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     829  1254  rgb     3   8  jpx    yes     3659  0   150   150 76.2K 2.5%
   1     1 image     829  1254  rgb     3   8  image  yes     3663  0   150   150 5250B 0.2%
   1     2 mask     1658  2508  -       1   1  image  yes     3663  0   300   300 5250B 1.0%
   2     3 image     934  1254  rgb     3   8  jpx    yes       11  0   150   150 85.6K 2.5%
   2     4 image     934  1254  rgb     3   8  image  yes       15  0   150   150 14.1K 0.4%
   2     5 mask     1868  2508  -       1   1  image  yes       15  0   300   300 14.1K 2.5%
   3     6 image     858  1243  rgb     3   8  jpx    yes       47  0   150   150 78.0K 2.5%
   3     7 image     858  1243  rgb     3   8  image  yes       51  0   150   150 7681B 0.2%

Could someone help me understand what I've got here, and how I might combine these three images to get a single images for each page. Or equivalently, just extract single images per page. They key issue for me is to keep as much information as is available in these images. I want to avoid degradation in quality.

What's your goal? I think pdfimages is for extracting images already in the document. If your goal is to convert the pages of the pdf to images preserving the look of the pdf, I'd use something like pdftoppm, pdftocairo or mutool instead. — frabjous, Oct 22 '22 at 15:04
@frabjous - The goal is to get images that look like what I see in the pdf, while preserving as much digital information as possible. The images are scans, and not very high quality as it it. I don't want to degrade that. My use of `pdfimages` was an attempt to preserve as much as possible. (Subsidiary goal is to understand what I'm looking at with these 3 images per page. What? Why?) — Diagon, Oct 22 '22 at 15:17
For those of us who cannot see the original PDF, that's still a difficult description of your goal to understand. Are the images you're hoping to extract just parts of each page, so there can be more than one? Or are you trying to get output images that look exactly like the full pdf pages, without losing any quality? If the latter, I'd still use one of the other tools I mentioned; they let you set the output resolution and compression level. Just make sure you choose something no worse than what you already have and you shouldn't lose quality. — frabjous, Oct 22 '22 at 18:11
Without access to the PDF or a full description of how it was made, it's hard for us to say why you're seeing the current output you are with pdfimages. (Or at least it's hard for me.) — frabjous, Oct 22 '22 at 18:12
@fraboujs - You're not seeing the images? I posted 3 examples of the mostly white/ mask/ main image sequence. The first and 3rd are meant to go on top of each other. The second is some sort of mask. — Diagon, Oct 22 '22 at 22:37
@fraboujs - ok, I'll try one of the other tools. I gather with pdftoppm/ pdftocairo or mutool, I can set the output resolution, so I'll have a look there. I'd prefer png if I won't be losing data, so perhaps pdftocairo might be the right choice ... Haven't looked at mutool yet. — Diagon, Oct 22 '22 at 22:41
@frabjous - I added `pdfimages -list` output, in case that helps give some idea what these multiple output pages might be. This part is a matter of curiosity. — Diagon, Oct 22 '22 at 23:05
Some similarities to [this question](https://unix.stackexchange.com/questions/722061/), what you want to achieve is sometimes referred to as "flatten" a PDF and rasterize without interpolation. The answers in [this question](https://unix.stackexchange.com/questions/162922/) all seem to use interpolation. We need a program which determines the highest resolution any of the embedded images has, use that as the document's resolution. Re-scale the lower resolution images by a power of two. Turn all offsets into integers. Then rendering can happen accurately down to the pixel. — Hermann, Oct 23 '22 at 10:23
@Hermann - that was very helpful, particularly that first link, thanks. Do you know if that conversion, using `pdfimages -png` will lose me any image quality? Also, tangential question, but in my case it's just the 1st and 3rd image that need to be merged on top of each other. When I try with composite, they are not recognized as having the same size. Do you have any suggestions? — Diagon, Oct 24 '22 at 00:38
@Geremia / you see the first layer has some text in there, right? So I need to merge the 1st and 3rd. — Diagon, Jul 27 '23 at 02:10
@Diagon I'd just use ImageMagick to [`-negate`](https://imagemagick.org/script/command-line-options.php#negate) (invert) the 3rd, e.g.: `convert 3rd.jp2 -negate inverted.jp2` — Geremia, Jul 27 '23 at 04:31
@Geremia / great, but now I'm still left with the original issue. "When I try with composite, they are not recognized as having the same size." Any suggestions as to what to do with that? — Diagon, Jul 28 '23 at 10:13
@Diagon If you want all the layers superimposed, then [RawBits's answer is best](https://unix.stackexchange.com/a/742638/37153). — Geremia, Jul 28 '23 at 17:09
@Geremia / Thanks, ya. That's what I did, but you see what Hermann said above in this thread, right? "We need a program which ..." — Diagon, Jul 28 '23 at 21:21
@Diagon Can't `pdfimages -list` help you with that? It gives PPI and width×height. — Geremia, Jul 28 '23 at 21:24
@Geremia / indeed you're right that in this case it's just a factor of 2 in each coordinate. Hmmmm. You pointed out that I can `-negate` one. How might I scale the other up by a factor of 2 in each direction? Then I can use `composite` and be on my way ... (Thanks for your help!) — Diagon, Jul 31 '23 at 13:01

score 1 · Accepted Answer · answered Apr 12 '23 at 02:08

I guess, you think you have a single image as a page and are surprised it's actually composited. This is a widespread method for archiving magazines as these are graphically more complex than a simple book with some images. It preserves the quality and gives a very small pdf file in the end - but unusably slow to render.

Now for the solution. You don't actually want to extract anything from the pdf. You want to render it the same way your pdf reader does. I would suggest using Ghostscript. Something like this will work:

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -r600 -dGraphicsAlphaBits=4 -sOutputFile=./img/img-%03d.png "$pdffilename"

Adjust as needed.

Extracting Images using pdfimages: Getting 3 Images Per Page: .jp2, .png, .jb2e

1 Answers1