Bug 789

Summary: gunzip/bunzip2/zcat output corrupted
Product: Busybox Reporter: Michael McTernan <mmcternan>
Component: OtherAssignee: unassigned
Status: RESOLVED INVALID    
Severity: major CC: busybox-cvs
Priority: P5    
Version: 1.15.x   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Host: ARM Cortex A8 Target: arm-none-linux-gnueabi
Build: 1.15.2

Description Michael McTernan 2009-12-10 15:44:03 UTC
Hi,

I'm seeing that sometimes the data returned from gunzip or zcat is corrupted.  Here is my test script:

  #!/bin/sh

  dd if=$1 of=data count=$2
  cat data | md5sum
  rm -f data.gz
  ls -l data
  gzip data
  ls -lh data.gz

  zcat data.gz | md5sum
  zcat data.gz | md5sum
  zcat data.gz | md5sum
  zcat data.gz | md5sum
  ...
  zcat data.gz | md5sum

The expected result is that the output show the same MD5 sum repeated:

  # ./gziptest /dev/urandom  1024
  1024+0 records in
  1024+0 records out
  6f662731c4e80f5d825131fa0cc894ff  -
  -rw-r--r--    1 0        0          524288 data
  -rw-r--r--    1 0        0          512.3K data.gz
  6f662731c4e80f5d825131fa0cc894ff  -
  6f662731c4e80f5d825131fa0cc894ff  -
  6f662731c4e80f5d825131fa0cc894ff  -
  6f662731c4e80f5d825131fa0cc894ff  -
  ...
  6f662731c4e80f5d825131fa0cc894ff  -

In this case all output hashes match so all is well.  Taking input from /dev/zero also produces correct results.  

However, when I try one of my own files which contains sparse data, the output *sometimes* does not match:

  # ./gziptest mybits   1024
  1024+0 records in
  1024+0 records out
  b542db59f26251cf7899484bc384dd3d  -
  -rw-r--r--    1 0        0          524288 data
  -rw-r--r--    1 0        0           43.7K data.gz
  b542db59f26251cf7899484bc384dd3d  -
  b542db59f26251cf7899484bc384dd3d  -
  b542db59f26251cf7899484bc384dd3d  -
  b542db59f26251cf7899484bc384dd3d  -
  b542db59f26251cf7899484bc384dd3d  -
  ....
  b542db59f26251cf7899484bc384dd3d  -
  58feaeb3349aeb1b97cef52ed454bc2c  -
  b542db59f26251cf7899484bc384dd3d  -

The failure rate is somewhere between 1 and 2 bad hashes in 20 attempts.  If I just md5sum the mybits file without going through any compression, I get the same hash repeatedly.  So I think this is a problem with gzip/zcat.  Inspecting a badly expanded file shows that only 1 or 2 bits somewhere in the file are wrong, although their location sometimes changes.

I wondered if the compiler was maybe doing something bad, so I set the following:

  #
  # Debugging Options
  #
  CONFIG_DEBUG=y
  CONFIG_DEBUG_PESSIMIZE=y
  # CONFIG_WERROR is not set
  CONFIG_NO_DEBUG_LIB=y
  # CONFIG_DMALLOC is not set
  # CONFIG_EFENCE is not set
  # CONFIG_INCLUDE_SUSv2 is not set

I verified this disabled optimisation using "make V=1", but it still fails.  This was with gcc version 4.2.3 (Sourcery G++ Lite 2008q1-126), but it also shows the same problem with gcc version 4.3.2 (Sourcery G++ Lite 2008q3-41).

I also tried the same tests on busybox 1.13.3 and got the same problem.

I decided that maybe I could live with bzip2 instead, so modified the test script to that.  Sadly this shows the same failures as well.

Finally I wondered if the corruption is typical.  Given a file with a known sum, the following test can be ran:

  while [ true ] ; do
    gunzip -c data.gz | md5sum | grep -v b542db59f26251cf7899484bc384dd3d
  done

The output shows the following:

  58feaeb3349aeb1b97cef52ed454bc2c  -
  58feaeb3349aeb1b97cef52ed454bc2c  -
  58feaeb3349aeb1b97cef52ed454bc2c  -
  58feaeb3349aeb1b97cef52ed454bc2c  -
  58feaeb3349aeb1b97cef52ed454bc2c  -
  ...
  58feaeb3349aeb1b97cef52ed454bc2c  -

i.e. the corruption is always the same.

I think this could be a hard bug, so let me know if you need more info.

Regards,

Mike
Comment 1 Denys Vlasenko 2009-12-10 22:04:43 UTC
(In reply to comment #0)
> However, when I try one of my own files which contains sparse data, the output
> *sometimes* does not match:

What is "sparse data"? File with holes? Can you attach data.gz which is producing errors when decompressed?

> Inspecting
> a badly expanded file shows that only 1 or 2 bits somewhere in the file are
> wrong, although their location sometimes changes.
...
> Finally I wondered if the corruption is typical.  Given a file with a known
> sum, the following test can be ran:
> 
>   while [ true ] ; do
>     gunzip -c data.gz | md5sum | grep -v b542db59f26251cf7899484bc384dd3d
>   done
> 
> The output shows the following:
> 
>   58feaeb3349aeb1b97cef52ed454bc2c  -
>   58feaeb3349aeb1b97cef52ed454bc2c  -
>   58feaeb3349aeb1b97cef52ed454bc2c  -
>   58feaeb3349aeb1b97cef52ed454bc2c  -
>   58feaeb3349aeb1b97cef52ed454bc2c  -
>   ...
>   58feaeb3349aeb1b97cef52ed454bc2c  -
> 
> i.e. the corruption is always the same.

You contradict yourself. Is corruption always the same, or not?
You effectively say "both" - ??!
> 
> I think this could be a hard bug, so let me know if you need more info.
> 
> Regards,
> 
> Mike
> 

Comment 2 Michael McTernan 2009-12-11 09:51:28 UTC
(In reply to comment #1)
> (In reply to comment #0)
> What is "sparse data"? File with holes? Can you attach data.gz which is
> producing errors when decompressed?

Apologies.  I tried to upload in the original report, but 400k is too large an attachment for this Bugzilla.  I'm trying to find somewhere else to put the data...  

The file itself is regular i.e. no holes,  The data has a lot of zeros in it with occasional runs of data sequences.  It's 4Meg uncompressed, and compressed to 400k - nice!

> > The output shows the following:
> > 
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> >   ...
> >   58feaeb3349aeb1b97cef52ed454bc2c  -
> > 
> > i.e. the corruption is always the same.
> 
> You contradict yourself. Is corruption always the same, or not?
> You effectively say "both" - ??!

The wrong hashes tend to take one of only a small set of values.  In the output above the erroneous decompressions have always given the wrong hash of 58feaeb3349aeb1b97cef52ed454bc2c, but rerunning the test at a different time may give a different wrong hash.

So the corruption is not always the same, but is far from random.  Examining the good and bad decompression results also showed them to differ in only one byte of the whole file.  

Aside, I tried building busybox on i686 Linux and running gunzip on the troublesome file under Valgrind with the memchecker.  It was completely clean or warnings/errors in that scenario, and also correctly compressed/decompressed the data repeatedly.

Comment 3 Denys Vlasenko 2009-12-11 10:50:10 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > (In reply to comment #0)
> > What is "sparse data"? File with holes? Can you attach data.gz which is
> > producing errors when decompressed?
> 
> Apologies.  I tried to upload in the original report, but 400k is too large an
> attachment for this Bugzilla.  I'm trying to find somewhere else to put the
> data...

Well, your 1st comment said "43.7K data.gz", can you attach that file?
If it is still too big, just email it to me.

> So the corruption is not always the same, but is far from random.  Examining
> the good and bad decompression results also showed them to differ in only one
> byte of the whole file.  
> 
> Aside, I tried building busybox on i686 Linux and running gunzip on the
> troublesome file under Valgrind with the memchecker.  It was completely clean
> or warnings/errors in that scenario, and also correctly compressed/decompressed
> the data repeatedly.

What is the CPU and architecture where it fails? Is it big endian?
If it is not reproducible on x86, we may end up in situation when only you can debug it.

In case it is just a case where we end up using an uninitialized variable somewhere, one quick & dirty method to test this theory is to go to decompress_unzip.c and add "= 0" / to every uninitialized local variable and memset to 0 every local uninitialized array, like this:

static int huft_build(const unsigned *b, const unsigned n,
                           const unsigned s, const unsigned short *d,
                           const unsigned char *e, huft_t **t, unsigned *m)
{
        unsigned a = 0;
        unsigned c[BMAX + 1];
        unsigned eob_len = 0;
        unsigned f = 0;
        int g = 0;
        int htl = 0;
        ...

memset(c, 0, sizeof(c));
        ...


If this makes bug go away, we can hunt it down further.
Comment 4 Michael McTernan 2009-12-11 11:41:36 UTC
(In reply to comment #3)
> 
> What is the CPU and architecture where it fails? Is it big endian?

ARM Cortex A8, so armv7-a.  It's little endian.

> If it is not reproducible on x86, we may end up in situation when only you can
> debug it.

I did wonder if this would be the case.  

> In case it is just a case where we end up using an uninitialized variable
> somewhere, one quick & dirty method to test this theory is to go to
> decompress_unzip.c and add "= 0" / to every uninitialized local variable and
> memset to 0 every local uninitialized array, like this:

Yeah - this is why I tried Valgrind on x86 - it's good at reporting errors on use of uninitialised data.  

I'll try gcc with -Wuninitialized and -Winit-self first, then try poking around the code as you say.  It might be a while before I come back...
Comment 5 Michael McTernan 2009-12-14 11:56:16 UTC
I've hacked on this for a while now, and found that it looks to be nothing specific to gunzip/bunzip/zcat or BusyBox.  The following small program, after enough runs on my target, shows different values being produced:

#include <stdio.h>

int main(int argc, char *args[])
{
  FILE *f = fopen(args[1], "rb");
  int   c;

  int sum1 = 0, sum2 = 0;

  while(1)
  {
      c = fgetc(f);
      if(c == EOF)
        break;
      printf("%c", c);

      sum1 += c;
      sum2 ^= c;
  }

  fclose(f);

  fprintf(stderr, "%x %x\n", sum1, sum2);

  return 0;
}


Something is wrong somewhere, but this shows it's not a BusyBox thing, so I'm resolving the bug as INVALID.

Apologies for any false alarm, and thanks for the suggestions on this issue.