4

I use pdfimages and convert recommended by Anthon to remove the OCRed text of a pdf file, and the size of the pdf file changes from 29MB to 373MB.

My first step is split the pdf file into a pbm file per pdf page:

mkdir tmp1
pdfimages ull.pdf tmp1/ull

The total size of the generated pbm files are 788M.

In my next step, I convert and combine the generated pbm files to a pdf file

cd tmp1
convert ull*.pbm all.pdf

This goes wrong, however, because it requires more than 1 GB space on /tmp, and my /tmp doesn't have that much free space. So my second step is actually:

mkdir tmp2
for i in ull-*.pbm; do convert $i tmp2/$i.pdf ; done
cd tmp2
pdftk ull-???.pbm.pdf ull-????.pbm.pdf cat output ../../all.pdf

The generated pdf file all.pdf has 373MB, much larger than the original size 29MB. I run pdftk all.pdf output new.pdf compress, but it doesn't reduce the file size.

Since all I want is to remove OCRed text from the pdf file, how can I avoid the file size bloating?

Tim
  • 98,580
  • 191
  • 570
  • 977
  • Have you tried setting TMPDIR to somewhere with enough space? TMPDIR is the canonical Unix environment variable that should be used to specify a temporary directory for scratch space. – fpmurphy Dec 07 '14 at 17:46

1 Answers1

3

If the original image are JPEG files, you could use pdfimages option -j. From man pdfimages:

-j     Normally, all images are written as PBM (for monochrome  images)
       or  PPM  (for  non-monochrome  images) files.  With this option,
       images in DCT format are  saved  as  JPEG  files.   All  non-DCT
       images are saved in PBM/PPM format as usual.

I am not sure how to control the way convert stores the images in the PDF file, but you can use -quality and -resize to alter the compression quality.

By calling convert in one of the following ways

TMPDIR=/home/tim/tmp  convert ...
MAGICK_TMPDIR=/home/tim/tmp convert ...

you can have convert use /home/tim/tmp as the temporary directory and circumvent the space problems. (Which probably has no influence on resulting file size).

Anthon
  • 78,313
  • 42
  • 165
  • 222