10

I have a gzip archive with trailing data. If I unpack it using gzip -d it tells me: "decompression OK, trailing garbage ignored" (same goes for gzip -t which can be used as a method of detecting that there is such data).

Now I would like to get to know this garbage, but strangely enough I couldn't find any way to extract it. gzip -l --verbose tells me that the "compressed" size of the archive is the size of the file (i.e. with the trailing data), that's wrong and not helpful. file is also of no help, so what can I do?

phk
  • 5,893
  • 7
  • 41
  • 70

2 Answers2

10

Figured out now how to get the trailing data.

I created Perl script which creates a file with the trailing data, it's heavily based on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604617#10:

#!/usr/bin/perl
use strict;
use warnings; 

use IO::Uncompress::Gunzip qw(:all);
use IO::File;

unshift(@ARGV, '-') unless -t STDIN;

my $input_file_name = shift;
my $output_file_name = shift;

if (! defined $input_file_name) {
  die <<END;
Usage:

  $0 ( GZIP_FILE | - ) [OUTPUT_FILE]

  ... | $0 [OUTPUT_FILE]

Extracts the trailing data of a gzip archive.
Outputs to stdout if no OUTPUT_FILE is given.
- as input file file causes it to read from stdin.

Examples:

  $0 archive.tgz trailing.bin

  cat archive.tgz | $0

END
}

my $in = new IO::File "<$input_file_name" or die "Couldn't open gzip file.\n";
gunzip $in => "/dev/null",
  TrailingData => my $trailing;
undef $in;

if (! defined $output_file_name) {
  print $trailing;
} else {
  open(my $fh, ">", $output_file_name) or die "Couldn't open output file.\n";
  print $fh $trailing;
  close $fh;
  print "Output file written.\n";
}
phk
  • 5,893
  • 7
  • 41
  • 70
  • 2
    +1 but IMO, printing to stdout as in the original (but without appending a newline) is better than writing to a hard-coded filename. You can redirect to a file, or pipe to `less` or `hd` or `hd | less` or whatever. – cas Jul 14 '16 at 06:13
  • @cas: Thank you for the input. Added a bit of parameter handling now. My first perl script BTW, I knew the time would come one day. – phk Jul 14 '16 at 10:44
  • 1
    nice improvement. i'd upvote it again if i could :) one more idea - a program like this doesn't really need an input file, it works just as well processing stdin. and a `while (<>)` loop in `perl` will read stdin and any file(s) listed in @ARGV....that makes it easy to write scripts that work equally well as a filter (i.e. read stdin, write to stdout) and with named file(s). and stdout, of course, can always be redirected to a file. most of my perl scripts are written as filters to take advantage of this. – cas Jul 14 '16 at 13:57
  • btw, you can do the same thing in sh scripts by putting something like `cat "$@" | ( ....your.main.script.here... )`. After any option/arg processing of course, so that all that's left in "$@" is either file names or nothing. – cas Jul 14 '16 at 13:59
  • @cas: Already possible, just use `-` as the input file name. Perl does the rest for you. – phk Jul 14 '16 at 14:34
  • True, but OTOH some programs understand `-` and others don't so I prefer to avoid it - a habit/practice that only works sometimes is too annoying to bother with. Giving `/dev/stdin` as filename is at least as annoying but more likely to work with any program. Anyway, a filter shouldn't need a `-` arg, it should just read from stdin by default and write to stdout by default....you don't need to give `-` as arg to a `sed` or `awk` script, for example. – cas Jul 14 '16 at 21:00
  • @cas You mean it should check per default whether the stdin is not a tty? I always wondered why some programs do this and some don't? Perhaps it's not reliable? Anyway, I agree with you that the "-" special file name isn't the best idea, it's another thing you need to keep in mind if you want to support all possible file names or make your scripts more secure. – phk Jul 14 '16 at 23:04
  • 1
    `push @ARGV,'-' if (!@ARGV);` before `my $input_file_name = shift;` is all that's needed here. i.e. a default arg of `-` (the help message could be printed if $ARGV[0] == '-h' or '--help'. ). For a `while(<>)` loop you wouldn't even need to do that, but it's probably more trouble than it's worth to write it like that for `IO::Uncompress::Gunzip`. – cas Jul 15 '16 at 00:36
  • @cas I personally prefer it if the help text is the default when called directly, so how about the current solution where I check whether stdin is a tty? `cat archive.gz | ./script.pl` still works, but so does `./script.pl archive.gz`, oh an BTW also `cat archive.gz | ./script.pl output.bin`. – phk Jul 15 '16 at 21:26
  • It's your code and your answer, i'm just making suggestions. I hope it doesn't sound like i'm trying to give orders or say that my way is the only way it can be done. Everyone has their own coding style and preferences (if your code was actually wrong or misleading, though - i'd certainly point that out - but it's not, it's good). The -h/--help suggestion was just a way of keeping the help text you'd already written. – cas Jul 15 '16 at 23:01
  • @cas Yes I value your suggestions, that's why I partly implemented your suggestion with `"-"` combined with a tty check and wanted to know what you thought about it. Maybe it's unusual or there are unforeseen corner cases or what do I know. – phk Jul 15 '16 at 23:21
  • 2
    it's fine. and unshift instead of push makes sense for how you want to use it, still allows an output filename to be specified as the only arg. I'm personally averse to having files being overwritten without some explicit order from the user - redirection or a `-o` option or something. having a script automagically switch from first arg of two being input to first and only arg being output seems risky and accident-prone to me (tempting murphy). – cas Jul 16 '16 at 00:07
0

I created a small script to find the gzip size:

#!/bin/bash

set -e
gzip=${1:?Inform a gzip file}
size=$(stat -c%s "$gzip")
min=0
max=$size
while true; do
        if head -c "$size" "$gzip" | gzip -v -t - &>/dev/null; then
                echo $size
                break
        else
                case "$?" in
                        1) min=$size ;;
                        2) max=$size ;;
                esac
                size=$(((max-min)/2 + min))
        fi
done

Then you can use it to extract the gzip and the trailing part:

file=gzip_with_trailing.gz
gzip_size=$(./find_gzip_size "$file")
head -c "$gzip_size" "$file" > data.gz
tail -c +$((1+gzip_size)) "$file" > trailing.raw

head/tail are not the fastest solution but it will work.