Bug 4550 - Segfault in Busybox while installing Ubuntu 11.10
Summary: Segfault in Busybox while installing Ubuntu 11.10
Status: NEW
Alias: None
Product: Busybox
Classification: Unclassified
Component: Other (show other bugs)
Version: 1.19.x
Hardware: PC Linux
: P5 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-11-27 21:42 UTC by Franz A.
Modified: 2012-03-18 23:35 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:


Attachments
Dump of assembler code from 0x8073f00 to 0x8073fff (2.69 KB, text/plain)
2011-11-29 20:46 UTC, Franz A.
Details
The "init" script, that starts busy box. Original location: /init in /install/netboot/ubuntu-installer/i386/initrd.gz of ubuntu-11.10-alternate-i386.iso (456 bytes, text/plain)
2011-12-08 19:04 UTC, Franz A.
Details
"/etc/inittab" from the same initrd.gz (492 bytes, text/plain)
2011-12-08 19:07 UTC, Franz A.
Details
The complete console output of the Ubuntu installation / crash with the new busybox 1.19.3 (compressed with bzip2). (26.99 KB, application/octet-stream)
2011-12-18 23:09 UTC, Franz A.
Details
".config" for building my busybox_unstripped (23.57 KB, text/plain)
2011-12-18 23:11 UTC, Franz A.
Details
Debugging to be added to init.c (1.19 KB, patch)
2011-12-20 01:31 UTC, Denys Vlasenko
Details
The complete console output of the Ubuntu installation / crash with the new busybox 1.19.3, with debug info from init.c, static (compressed with bzip2). (21.85 KB, application/octet-stream)
2012-01-24 20:17 UTC, Franz A.
Details
The same console output, but now with shared_libs and electric fence. (8.03 KB, application/octet-stream)
2012-01-24 20:20 UTC, Franz A.
Details
The same console output, but now with shared_libs and electric fence (and all libs replaced). (8.37 KB, application/octet-stream)
2012-01-29 21:05 UTC, Franz A.
Details
Debugging patch (2.07 KB, patch)
2012-01-30 11:10 UTC, Denys Vlasenko
Details
Complete console output of the Ubuntu installation with busybox 1.19.3 and the debug version of init.c (bzip2) (22.89 KB, application/octet-stream)
2012-02-10 22:32 UTC, Franz A.
Details
init.c with debug patch (39.98 KB, text/x-csrc)
2012-02-10 22:35 UTC, Franz A.
Details
Console output of the installation with the free-and-nullify test version of busybox (bzip2) (25.95 KB, application/octet-stream)
2012-02-14 19:43 UTC, Franz A.
Details
Console output of the installation with the SEGV in ash.c, still with free-and-nullify (bzip2) (21.82 KB, application/octet-stream)
2012-03-18 23:35 UTC, Franz A.
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Franz A. 2011-11-27 21:42:45 UTC
First of all let me apologize for posting this problem once more. The original place is https://bugs.launchpad.net/ubuntu/+source/busybox/+bug/881131 (Did not know, that Busybox has a different place to report problems.)

Ubuntu 11.10 Oneiric Ocelot freezes during installation on
CPU: Athlon XP (0.13)
Chipset: VIA KT400(A)/600.
while working on the apparmor package. This is repeatable. 

I verified, that memtest reports no errors. Ok. 

The last line of the syslog is: "in-target: Setting up apparmor (2.7.0~beta1+bzr1774-1ubuntu2) ..."

The console output after that is:
[ 574.367268] busybox[1]: segfault at 95f7504 ip 08073fc9 sp bf9ec98c error 4 in busybox[8048000+4a000]
[ 574.478310] Kernel panic - not syncing: Attempted to kill init!
[ 574.549267] Pid: 1, comm: busybox Not tainted 3.0.0-12-generic #20-Ubuntu
[ 574.630520] Call Trace:
[ 574.659831] [<c1518c98>] ? printk+0x2d/0x2f
[ 574.710876] [<c1518b76>] panic+0x5c/0x151
[ 574.759850] [<c104b244>] forget_original_parent+0x1e4/0x1f0
[ 574.827553] [<c152c6fd>] ? _raw_spin_lock_irqsave+0x2d/0x40
[ 574.895233] [<c104b263>] exit_notify+0x13/0x140
[ 574.950452] [<c104ba6d>] do_exit+0x1ad/0x3a0
[ 575.002545] [<c104bdb8>] do_group_exit+0x38/0xa0
[ 575.058807] [<c105a6b5>] get_signal_to_deliver+0x215/0x3c0
[ 575.125497] [<c10e5324>] ? free_pages+0x34/0x40
[ 575.180773] [<c152fa00>] ? vmalloc_fault+0xee/0xee
[ 575.239104] [<c1002801>] do_signal+0x61/0x110
[ 575.292244] [<c152a4d1>] ? schedule+0x2f1/0x680
[ 575.347458] [<c152fe9b>] ? do_page_fault+0x49b/0x4a0
[ 575.407868] [<c1049ad0>] ? wait_task_continued+0x140/0x140
[ 575.474522] [<c1002ad5>] do_notify_resume+0x75/0x90
[ 575.533898] [<c152c9b0>] work_notifysig+0x13/0x1b
Comment 1 Denys Vlasenko 2011-11-28 03:13:00 UTC
Which version of busybox is it?

Apparently, busybox is used as process with PID 1 here. It segfaults, and Linux does not like when PID 1 exits. But as *which applet* busybox is used here? init? linuxrc? runsvdir?

This line:

"segfault at 95f7504 ip 08073fc9 sp bf9ec98c error 4 in busybox[8048000+4a000]"

can be used to find where exactly segfault happens. Load /bin/busybox into gdb and run

disas 0x08073f00 0x08073fff

If you have busybox_unstripped from the build, running "disas 0x08073fc9" on it is likely to be much more informative...
Comment 2 Franz A. 2011-11-29 20:41:47 UTC
a) The busybox version is:
BusyBox v1.18.4 (Ubuntu 1:1.18.4-2ubuntu2) multi-call binary.

b) I am not sure how to find the answer to your applet question.
In the initrd.gz I found a Shell script named init, that execs '/bin/busybox init'.
Is that, what you wanted to know?
Comment 3 Franz A. 2011-11-29 20:46:29 UTC
Created attachment 3836 [details]
Dump of assembler code from 0x8073f00 to 0x8073fff

Unfortunately I have no unstripped version of busybox. Sorry.
Comment 4 Franz A. 2011-11-29 21:34:32 UTC
I cannot find an unstripped version of busybox in http://at.archive.ubuntu.com/ubuntu/pool/main/b/busybox/ either.

What else can I do to help finding this problem?

Idea: Build busybox from source, do not strip it, make a new initrd.gz with this busybox, boot the install CD with this initrd and see, where it crashes. Or would you rather recommend an easier way?
Comment 5 Denys Vlasenko 2011-12-05 02:12:04 UTC
(In reply to comment #2)
> a) The busybox version is:
> BusyBox v1.18.4 (Ubuntu 1:1.18.4-2ubuntu2) multi-call binary.
> 
> b) I am not sure how to find the answer to your applet question.
> In the initrd.gz I found a Shell script named init, that execs '/bin/busybox
> init'.

Looks like it uses init applet. Can you show the entire script verbatim?

Can you post the /etc/inittab file?
Comment 6 Denys Vlasenko 2011-12-05 02:46:28 UTC
(In reply to comment #3)
> Created attachment 3836 [details]
> Dump of assembler code from 0x8073f00 to 0x8073fff
> 
> Unfortunately I have no unstripped version of busybox. Sorry.

The function we SEGV is this one:

   0x08073fbb:  xor    %edx,%edx
   0x08073fbd:  test   %eax,%eax
   0x08073fbf:  jle    0x8073fdd
   0x08073fc1:  mov    0x80936b4,%edx
   0x08073fc7:  jmp    0x8073fd9
   0x08073fc9:  cmp    %eax,0x4(%edx)   <==== SEGVs here
   0x08073fcc:  jne    0x8073fd7
   0x08073fce:  movl   $0x0,0x4(%edx)
   0x08073fd5:  jmp    0x8073fdd
   0x08073fd7:  mov    (%edx),%edx
   0x08073fd9:  test   %edx,%edx
   0x08073fdb:  jne    0x8073fc9
   0x08073fdd:  mov    %edx,%eax
   0x08073fdf:  ret

I think it's mark_terminated() function, since I see a very similar code in my local build of busybox:

00000000 <mark_terminated>:
   0:   53                      push   %ebx
   1:   83 ec 08                sub    $0x8,%esp
   4:   31 db                   xor    %ebx,%ebx
   6:   85 c0                   test   %eax,%eax
   8:   7e 30                   jle    3a <mark_terminated+0x3a>
   a:   8b 1d 00 00 00 00       mov    0x0,%ebx
                        c: R_386_32     .bss.init_action_list
  10:   eb 10                   jmp    22 <mark_terminated+0x22>
  12:   39 43 04                cmp    %eax,0x4(%ebx)
  15:   75 09                   jne    20 <mark_terminated+0x20>
  17:   c7 43 04 00 00 00 00    movl   $0x0,0x4(%ebx)
  1e:   eb 1a                   jmp    3a <mark_terminated+0x3a>
  20:   8b 1b                   mov    (%ebx),%ebx
  22:   85 db                   test   %ebx,%ebx
  24:   75 ec                   jne    12 <mark_terminated+0x12>
  26:   51                      push   %ecx
  27:   51                      push   %ecx
  28:   6a 00                   push   $0x0
  2a:   6a 00                   push   $0x0
  2c:   31 c9                   xor    %ecx,%ecx
  2e:   ba 08 00 00 00          mov    $0x8,%edx
  33:   e8 fc ff ff ff          call   34 <mark_terminated+0x34>
                        34: R_386_PC32  update_utmp
  38:   58                      pop    %eax
  39:   5a                      pop    %edx
  3a:   89 d8                   mov    %ebx,%eax
  3c:   59                      pop    %ecx
  3d:   5b                      pop    %ebx
  3e:   5b                      pop    %ebx
  3f:   c3                      ret

C code is:

static struct init_action *mark_terminated(pid_t pid)
{
        struct init_action *a;

        if (pid > 0) {
                for (a = init_action_list; a; a = a->next) {
                        if (a->pid == pid) {
                                a->pid = 0;
                                return a;
                        }
                }
                update_utmp(pid, DEAD_PROCESS, /*tty_name:*/ NULL,
                                /*username:*/ NULL,
                                /*hostname:*/ NULL);
        }
        return NULL;
}

Hmm. init_action_list got corrupted and a->next contains garbage? How this can happen? We only manipulate the list in new_init_action(), but I don't see any obvious bugs there which can mangle ->next value...
Comment 7 Franz A. 2011-12-08 19:04:28 UTC
Created attachment 3866 [details]
The "init" script, that starts busy box. Original location: /init in /install/netboot/ubuntu-installer/i386/initrd.gz of ubuntu-11.10-alternate-i386.iso
Comment 8 Franz A. 2011-12-08 19:07:37 UTC
Created attachment 3872 [details]
"/etc/inittab" from the same initrd.gz
Comment 9 Franz A. 2011-12-18 23:03:57 UTC
I downloaded busybox-1.19.3 source, built it (static), then put busybox_unstripped into the initrd as busybox.
I re-produced the crash and got "busybox[1]: segfault at 1a08 ip 00001a08 sp bfe033b4 error 4"
and then again 
Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: busybox Not tainted 3.0.0-12-generic #20-Ubuntu

But:
gdb busybox-1.19.3/busybox_unstripped
(gdb) disas 0x00001a08
No function contains specified address.
(gdb) disas 0x00001a00,0x00002000
Dump of assembler code from 0x1a00 to 0x2000:
   0x00001a00:	Cannot access memory at address 0x1a00

Please advise.
Comment 10 Franz A. 2011-12-18 23:09:23 UTC
Created attachment 3902 [details]
The complete console output of the Ubuntu installation / crash with the new busybox 1.19.3  (compressed with bzip2).
Comment 11 Franz A. 2011-12-18 23:11:58 UTC
Created attachment 3908 [details]
".config" for building my busybox_unstripped
Comment 12 Denys Vlasenko 2011-12-20 01:31:11 UTC
>> In the initrd.gz I found a Shell script named init, that execs '/bin/busybox
>> init'.
>
>Looks like it uses init applet. Can you show the entire script verbatim?
>
>Can you post the /etc/inittab file?

I still need the script and /etc/inittab file.



(In reply to comment #9)
> I downloaded busybox-1.19.3 source, built it (static), then put
> busybox_unstripped into the initrd as busybox.
> I re-produced the crash and got "busybox[1]: segfault at 1a08 ip 00001a08 sp
> bfe033b4 error 4"
> and then again 
> Kernel panic - not syncing: Attempted to kill init!
> Pid: 1, comm: busybox Not tainted 3.0.0-12-generic #20-Ubuntu
> 
> But:
> gdb busybox-1.19.3/busybox_unstripped
> (gdb) disas 0x00001a08
> No function contains specified address.
> (gdb) disas 0x00001a00,0x00002000
> Dump of assembler code from 0x1a00 to 0x2000:
>    0x00001a00:    Cannot access memory at address 0x1a00
> 
> Please advise.

I can only propose that you patch init.c (add debug printouts) and help to diagnose the problem.

See attached 6.patch
Please apply it, recompile, repeat the boot, and let me know what messages do you see.

-- 
vda
Comment 13 Denys Vlasenko 2011-12-20 01:31:54 UTC
Created attachment 3932 [details]
Debugging to be added to init.c
Comment 14 Denys Vlasenko 2011-12-20 01:38:11 UTC
also(In reply to comment #12)
> >> In the initrd.gz I found a Shell script named init, that execs '/bin/busybox
> >> init'.
> >
> >Looks like it uses init applet. Can you show the entire script verbatim?
> >
> >Can you post the /etc/inittab file?
> 
> I still need the script and /etc/inittab file.

Oh, sorry, I see you attached them. Looked through... they don't seem to do anything unusual...
Comment 15 Franz A. 2012-01-24 20:17:05 UTC
Created attachment 3986 [details]
The complete console output of the Ubuntu installation / crash with the new busybox 1.19.3, with debug info from init.c, static (compressed with bzip2).
Comment 16 Franz A. 2012-01-24 20:20:56 UTC
Created attachment 3992 [details]
The same console output, but now with shared_libs and electric fence.
Comment 17 Franz A. 2012-01-24 20:32:04 UTC
Sorry for the delay.
I applied your patch to init.c, made a new busybox, static, ran the installation again. And again the expected segfault happened. I did not see any, but I hope, that you can find helpful information in the console output.

I also tried electric fence. I had a problem with the libraries, so maybe that second log is not useful. At least it reported several mallocs with size 0.
(On my build PC I have a newer kernel than on the install CD image. So, busybox complained about a version mismatch in glibc. After replacing the glibc in the initrd it started up and the result is in the second log. I will run another test with all libraries replaced and hope for better results.)
Comment 18 Franz A. 2012-01-29 21:05:41 UTC
Created attachment 4004 [details]
The same console output, but now with shared_libs and electric fence (and all libs replaced).
Comment 19 Denys Vlasenko 2012-01-30 11:10:21 UTC
Created attachment 4010 [details]
Debugging patch

I looked at the data but, unfortunately, my crude debug additions didn't help to pin down the location of the crash.

It's great that you are able to build and use your own busybox - the matching busybox_unstripped is available, and we can map addresses to functions.

Please find attached patch which adds a SIGSEGV handler to init. It will print something like this on SEGV:

signal:11 address:0x123 ip:0x80dba39
./busybox[0x80db94a]
[0xb786240c]
/bin/busybox[0x80dba39]
/bin/busybox[0x804dcaa]
/bin/busybox[0x804dccd]
/bin/busybox[0x804df7d]
/bin/busybox[0x804e018]
/lib/libc.so.6(__libc_start_main+0xf3)[0xb76916b3]
/bin/busybox[0x804d66d]

and then sleep forever. ip:ADDR and the trace would be very useful to see!

Can you apply the patch to 1.19.3 and try to reproduce the crash? If yes, send me the resulting messages and the busybox_unstripped binary (exactly that binary which was used to obtain the messages!)

Note: if you aren't using glibc and therefore build fails, remove "include <execinfo.h>" and "glibc extension" block in handle_sigsegv(). We won't have a backtrace, but the ip:ADDR will still be printed, and we'll know the location where it SEGVed.
Comment 20 Franz A. 2012-02-10 22:28:25 UTC
Finally I was able to re-produce the problem again.
busybox[1]: segfault at c5 ip 08251bd8 sp bfb036d4 error 6
Could not find the "signal:11 ..." in the log. But at least the ip address now points to something, that gdb can disassemble:

(gdb) disas 0x08251bd8
Dump of assembler code for function __EH_FRAME_BEGIN__:
   0x082416a8 <+0>:	adc    $0x0,%al
   0x082416aa <+2>:	add    %al,(%eax)
   0x082416ac <+4>:	add    %al,(%eax)
   0x082416ae <+6>:	add    %al,(%eax)
   0x082416b0 <+8>:	add    %edi,0x52(%edx)
   0x082416b3 <+11>:	add    %al,(%ecx)
   0x082416b5 <+13>:	jl     0x82416bf <__EH_FRAME_BEGIN__+23>
   0x082416b7 <+15>:	add    %ebx,(%ebx)
   0x082416b9 <+17>:	or     $0x4,%al
   0x082416bb <+19>:	add    $0x88,%al
   0x082416bd <+21>:	add    %eax,(%eax)
   0x082416bf <+23>:	add    %dl,(%eax,%eax,1)
   0x082416c2 <+26>:	add    %al,(%eax)
   0x082416c4 <+28>:	sbb    $0x0,%al
   0x082416c6 <+30>:	add    %al,(%eax)
   0x082416c8 <+32>:	nop
   0x082416c9 <+33>:	inc    %esp
   0x082416ca <+34>:	jmp    0x82416cb <__EH_FRAME_BEGIN__+35>
   0x082416cc <+36>:	imul   $0x0,(%eax),%eax
   0x082416cf <+39>:	add    %al,(%eax)
Comment 21 Franz A. 2012-02-10 22:32:57 UTC
Created attachment 4022 [details]
Complete console output of the Ubuntu installation with busybox 1.19.3 and the debug version of init.c (bzip2)
Comment 22 Franz A. 2012-02-10 22:35:41 UTC
Created attachment 4028 [details]
init.c with debug patch
Comment 23 Franz A. 2012-02-10 22:46:28 UTC
The unstripped busybox, that I used for running the above test install, is too large to be uploaded here.
Please load it from http://members.aon.at/afp/tmp/busybox.bz2
Comment 24 Franz A. 2012-02-10 22:56:33 UTC
I am not sure, if this information is helpful for you:
In a previous test run (different build) I got the results below.
The possibly interesting thing there is, that again none of the addresses point to a valid address. But I noticed, that all 'addresses' end with "...f13", i.e. probably the same pattern was used when overwriting the stack or whatever.

Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  988.968953] sh[5630]: segfault at 4 ip 00f3bf13 sp b$
Feb  3 20:58:40 kernel: [  988.972801] sh[5632]: segfault at 4 ip 00234f13 sp b$
Feb  3 20:58:40 kernel: [  988.976872] sh[5634]: segfault at 4 ip 00268f13 sp b$
Feb  3 20:58:40 kernel: [  988.980672] sh[5636]: segfault at 4 ip 006f2f13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  988.985048] sh[5638]: segfault at 4 ip 00465f13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  988.988930] sh[5640]: segfault at 4 ip 00a0cf13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  988.992841] sh[5642]: segfault at 4 ip 00bd0f13 sp b$


Feb  3 20:58:40 kernel: [  988.996741] sh[5644]: segfault at 4 ip 00f13f13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  989.000677] sh[5646]: segfault at 4 ip 0044ff13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$
Feb  3 20:58:40 kernel: [  989.004558] sh[5648]: segfault at 4 ip 00deff13 sp b$
Feb  3 20:58:40 in-target: install-info: warning: no info dir entry in `/usr/sh$

Reading symbols from /tmp/initrd_tmp/bin/busybox...done.
(gdb) disas 0x00f3bf13
No function contains specified address.
(gdb) disas 0x00234f13
No function contains specified address.
(gdb) disas 0x00465f13
No function contains specified address.
(gdb) disas 0x00268f13
No function contains specified address.
(gdb) disas 0x00f13f13
No function contains specified address.
(gdb) disas 0x00deff13
No function contains specified address.
Comment 25 Franz A. 2012-02-10 23:13:12 UTC
(In reply to comment #21)
> Created attachment 4022 [details]
> Complete console output of the Ubuntu installation with busybox 1.19.3 and the
> debug version of init.c (bzip2)

I found a second segfault in this log.
Although the ip has a different address, the disassembly looks just like before, in #20.
Comment 26 Franz A. 2012-02-14 19:38:39 UTC
I have an idea how to make the program fail closer to the point, where the real problem is: I replaced every "free(ptr)" with free, followed by ptr=NULL. Just in case an already freed memory block is used again. And this time I get:

busybox[1]: segfault at 6e6f6944 ip 080705ad sp bfc34930 error 4  busybox[8048000+22b000]
busybox[1]: segfault at 6e6f6940 ip 08069485 sp bfc34318 error 4  busybox[8048000+22b000]
Kernel panic - not syncing: Attempted to kill init!

To me the gdb output looks promising:
(gdb) disas 0x08069485
Dump of assembler code for function _IO_new_file_attach:
   0x08069440 <+0>:	sub    $0x24,%esp
   0x08069443 <+3>:	mov    %ebx,0x14(%esp)
   0x08069447 <+7>:	mov    0x28(%esp),%ebx
   0x0806944b <+11>:	mov    %esi,0x18(%esp)
   0x0806944f <+15>:	mov    %edi,0x1c(%esp)
   0x08069453 <+19>:	mov    %ebp,0x20(%esp)
   0x08069457 <+23>:	cmpl   $0xffffffff,0x38(%ebx)
   0x0806945b <+27>:	jne    0x80694e8 <_IO_new_file_attach+168>
   0x08069461 <+33>:	mov    0x2c(%esp),%eax
   0x08069465 <+37>:	mov    $0xffffffcc,%esi
   0x0806946b <+43>:	mov    %gs:0x0,%edi
   0x08069472 <+50>:	movl   $0xffffffff,0x4c(%ebx)
   0x08069479 <+57>:	mov    %eax,0x38(%ebx)
   0x0806947c <+60>:	mov    (%ebx),%eax
   0x0806947e <+62>:	movl   $0xffffffff,0x50(%ebx)
   0x08069485 <+69>:	mov    (%edi,%esi,1),%ebp
   0x08069488 <+72>:	and    $0xfffffff3,%eax
 
And:
(gdb) disas 0x080705ad
Dump of assembler code for function malloc:
   0x08070580 <+0>:	sub    $0x14,%esp
   0x08070583 <+3>:	mov    0x82748d4,%eax
   0x08070588 <+8>:	test   %eax,%eax
   0x0807058a <+10>:	mov    %ebx,0x8(%esp)
   0x0807058e <+14>:	mov    0x18(%esp),%ebx
   0x08070592 <+18>:	mov    %esi,0xc(%esp)
   0x08070596 <+22>:	mov    %edi,0x10(%esp)
   0x0807059a <+26>:	jne    0x8070716 <malloc+406>
   0x080705a0 <+32>:	mov    $0xffffffd0,%edx
   0x080705a6 <+38>:	mov    %gs:0x0,%ecx
   0x080705ad <+45>:	mov    (%ecx,%edx,1),%ecx
 
You can download this busybox from 
   http://members.aon.at/afp/tmp/busybox_free_and_null.bz2
Comment 27 Franz A. 2012-02-14 19:43:03 UTC
Created attachment 4034 [details]
Console output of the installation with the free-and-nullify test version of busybox (bzip2)
Comment 28 Denys Vlasenko 2012-02-23 01:48:35 UTC
Franz, if you are using modified busybox per comment 22 (where init.c has handle_sigsegv() function), then I don't understand why you don't see the _result_ of that function triggering.

The messages you show:

segfault at NNNNNN ip NNNNNNN sp NNNNNNNN error N

should not appear - the handle_sigsegv() should print its stuff instead, and then loop forever, sleeping. Something is seriously wrong...
The functions you are disassembling in few last comments appear random... another bad sign.

Can you start by verifying that you can trigger handle_sigsegv() by sending a SEGV to init by hand:

kill -SEGV 1

should do it.
Comment 29 Franz A. 2012-02-24 23:05:35 UTC
Hi Denys,

Yes, I was also wondering, where the "segfault ..." text came from. I could not find it in your source code.

So, I tried several things in the past few days, with little success:
- Added a printf right after the sigaction, that activates the segfault handler. And indeed, it proofed the activation. And so it also proofed, that "my" busybox was in use during the installation. But still I got the "segfault ..." without backtrace.

- I guessed, that the segfault maybe happens in the shell area. So, I added your segfault_handler also to ash.c. Almost the same result: activation printf seen many times. Again no backtrace. Only the strange "segfault ..." message.

- Added code to force a SEGV in a place, that I could access at will: In the set command output: { char *ptr=NULL; *ptr='a'; }. And indeed, I got a backtrace.

- So maybe it came from an other place within busybox. Added the segfault handler also to the main in appletlib.c. Same result: no backtrace.

- Found, that you set signal handlers in various places. Not for SEGV, but for other signals. Just in case I overlooked something, I commented out all the other sigactions. Again no backtrace. Slowly running out of ideas ...

- Maybe the "segfault ..." came from a default handler in some standard library. So I made a 3-line-program:

#include <stdio.h>
int main (int argc, char **argv)
{ char *ptr=NULL; *ptr='a'; }

$ gcc x.c
$ ./a.out
Segmentation fault
So, it was not the standard message either. Maybe from a different library.

- Searched for the text in busybox: strings busybox | grep segf ... nothing, even though it was statically linked.

- Last idea (not tried yet, because your answer came before that): I saw, that you set the default SIG_DFL for several signals. Maybe commenting out these would get me to the backtrace.

- I searched for the text "segfault" in libraries and found it in several of them. But I tried this (in a shell on my dev PC, not during installation):
$ export LD_DEBUG=files
$ ./busybox sh
/tmp/initrd_tmp/bin $ exit
$ ./busybox init
init: must be run as PID 1
In other words: No additional library loaded. No big surprise, busybox was statically linked.

So, now I know, that the text is not IN busybox and there is no additional library. But still it is printed. Very strange.


A possible explanation:
The segfault happens very late during the installation, after installing more than 100 packages, including the busybox package. This is just a thesis, I do not know, if it really happens during the installation, if it would make any sense to do it. But if the installation procedure's init would 'exec' the busybox from the newly installed busybox package, then it would still have PID=1 but it would be a different executable. Possibly with the segfault message in it. What do you think about this (weird) idea?

I will run your "kill -SEGV 1" test, when I can access my test PC again in a few days.

Do you have any other idea, where the message might come from?

Best regards
Franz
Comment 30 Franz A. 2012-03-08 23:18:28 UTC
Hi Denys,

The 'kill -SEGV 1' test during the Ubuntu installation returned the expected message from the signal handler and a backtrace.

I will add a handler for e.g. USR2, that prints a message. Then I will send this signal periodically to find out, when it stops working.

Best regards
Franz
Comment 31 Franz A. 2012-03-18 23:21:14 UTC
Hi Denys,
Even though I tried to enter exactly the same answers to all the questions during the 12 minutes of installation,
I got different results each time. As you said: strange.

After lots of tries I finally made it output a (short) backtrace:
signal in ash.c or friends: 11 address: 0x0 ip: 0x8164005
[0x816ad35]
[0x39e40c]
[0x8164005]

(gdb) disas 0x8164005
Dump of assembler code for function evalfor:
   0x08163ede <+0>:	sub    $0x2c,%esp
   0x08163ee1 <+3>:	lea    0xc(%esp),%eax
   0x08163ee5 <+7>:	mov    %eax,(%esp)
   0x08163ee8 <+10>:	call   0x815a517 <setstackmark>
   0x08163eed <+15>:	movl   $0x0,0x1c(%esp)
   0x08163ef5 <+23>:	lea    0x1c(%esp),%eax
   0x08163ef9 <+27>:	mov    %eax,0x20(%esp)
   0x08163efd <+31>:	mov    0x30(%esp),%eax
   0x08163f01 <+35>:	mov    0x4(%eax),%eax
   0x08163f04 <+38>:	mov    %eax,0x24(%esp)
   0x08163f08 <+42>:	jmp    0x8163f3e <evalfor+96>
   0x08163f0a <+44>:	movl   $0x23,0x8(%esp)
   0x08163f12 <+52>:	lea    0x1c(%esp),%eax
   0x08163f16 <+56>:	mov    %eax,0x4(%esp)
   0x08163f1a <+60>:	mov    0x24(%esp),%eax
   0x08163f1e <+64>:	mov    %eax,(%esp)
   0x08163f21 <+67>:	call   0x81622cd <expandarg>
   0x08163f26 <+72>:	mov    0x8276f42,%al
   0x08163f2b <+77>:	test   %al,%al
   0x08163f2d <+79>:	jne    0x8164026 <evalfor+328>
   0x08163f33 <+85>:	mov    0x24(%esp),%eax
   0x08163f37 <+89>:	mov    0x4(%eax),%eax
   0x08163f3a <+92>:	mov    %eax,0x24(%esp)
   0x08163f3e <+96>:	cmpl   $0x0,0x24(%esp)
   0x08163f43 <+101>:	jne    0x8163f0a <evalfor+44>
   0x08163f45 <+103>:	mov    0x20(%esp),%eax
   0x08163f49 <+107>:	movl   $0x0,(%eax)
   0x08163f4f <+113>:	movb   $0x0,0x8276f3d
   0x08163f56 <+120>:	mov    0x8276aa0,%eax
   0x08163f5b <+125>:	inc    %eax
   0x08163f5c <+126>:	mov    %eax,0x8276aa0
   0x08163f61 <+131>:	andl   $0x2,0x34(%esp)
   0x08163f66 <+136>:	mov    0x1c(%esp),%eax
   0x08163f6a <+140>:	mov    %eax,0x28(%esp)
   0x08163f6e <+144>:	jmp    0x816400b <evalfor+301>
   0x08163f73 <+149>:	mov    0x28(%esp),%eax
   0x08163f77 <+153>:	mov    0x4(%eax),%edx
   0x08163f7a <+156>:	mov    0x30(%esp),%eax
   0x08163f7e <+160>:	mov    0xc(%eax),%eax
   0x08163f81 <+163>:	movl   $0x0,0x8(%esp)
   0x08163f89 <+171>:	mov    %edx,0x4(%esp)
   0x08163f8d <+175>:	mov    %eax,(%esp)
   0x08163f90 <+178>:	call   0x815b1a4 <setvar>
   0x08163f95 <+183>:	mov    0x30(%esp),%eax                 n->  
   0x08163f99 <+187>:	mov    0x8(%eax),%eax                  n->nfor.body  
   0x08163f9c <+190>:	mov    0x34(%esp),%edx                 flags 
   0x08163fa0 <+194>:	mov    %edx,0x4(%esp)
   0x08163fa4 <+198>:	mov    %eax,(%esp)
   0x08163fa7 <+201>:	call   0x81639e5 <evaltree>            evaltree(n->nfor.body, flags);
   0x08163fac <+206>:	mov    0x8276f42,%al                   if  (evalskip) 
   0x08163fb1 <+211>:	test   %al,%al
   0x08163fb3 <+213>:	je     0x8164001 <evalfor+291>
   0x08163fb5 <+215>:	mov    0x8276f42,%al                   {
   0x08163fba <+220>:	cmp    $0x2,%al                            evalskip == SKIPCONT 
   0x08163fbc <+222>:	jne    0x8163fdb <evalfor+253>
   0x08163fbe <+224>:	mov    0x8276a98,%eax
   0x08163fc3 <+229>:	dec    %eax
   0x08163fc4 <+230>:	mov    %eax,0x8276a98
   0x08163fc9 <+235>:	mov    0x8276a98,%eax
   0x08163fce <+240>:	test   %eax,%eax
   0x08163fd0 <+242>:	jg     0x8163fdb <evalfor+253>
   0x08163fd2 <+244>:	movb   $0x0,0x8276f42                 evalskip = 0;
   0x08163fd9 <+251>:	jmp    0x8164001 <evalfor+291>        continue;
   0x08163fdb <+253>:	mov    0x8276f42,%al                  evalskip == SKIPBREAK
   0x08163fe0 <+258>:	cmp    $0x1,%al                          ...
   0x08163fe2 <+260>:	jne    0x8164018 <evalfor+314>
   0x08163fe4 <+262>:	mov    0x8276a98,%eax
   0x08163fe9 <+267>:	dec    %eax
   0x08163fea <+268>:	mov    %eax,0x8276a98
   0x08163fef <+273>:	mov    0x8276a98,%eax
   0x08163ff4 <+278>:	test   %eax,%eax
   0x08163ff6 <+280>:	jg     0x8164018 <evalfor+314>
   0x08163ff8 <+282>:	movb   $0x0,0x8276f42                  evalskip = 0; 
   0x08163fff <+289>:	jmp    0x8164018 <evalfor+314>          break; }
   0x08164001 <+291>:	mov    0x28(%esp),%eax
   0x08164005 <+295>:	mov    (%eax),%eax           <----SEGV----  sp = sp->next
   0x08164007 <+297>:	mov    %eax,0x28(%esp)
   0x0816400b <+301>:	cmpl   $0x0,0x28(%esp)                  ; sp ;
   0x08164010 <+306>:	jne    0x8163f73 <evalfor+149>
   0x08164016 <+312>:	jmp    0x8164019 <evalfor+315>
   0x08164018 <+314>:	nop
   0x08164019 <+315>:	mov    0x8276aa0,%eax
   0x0816401e <+320>:	dec    %eax
   0x0816401f <+321>:	mov    %eax,0x8276aa0
   0x08164024 <+326>:	jmp    0x8164027 <evalfor+329>
   0x08164026 <+328>:	nop
   0x08164027 <+329>:	lea    0xc(%esp),%eax
   0x0816402b <+333>:	mov    %eax,(%esp)
   0x0816402e <+336>:	call   0x815a55e <popstackmark>
   0x08164033 <+341>:	add    $0x2c,%esp
   0x08164036 <+344>:	ret    
End of assembler dump.
static void
evalfor(union node *n, int flags)
{
	struct arglist arglist;
	union node *argp;
	struct strlist *sp;
	struct stackmark smark;

	setstackmark(&smark);
	arglist.list = NULL;
	arglist.lastp = &arglist.list;
	for (argp = n->nfor.args; argp; argp = argp->narg.next) {
		expandarg(argp, &arglist, EXP_FULL | EXP_TILDE | EXP_RECORD);
		/* XXX */
		if (evalskip)
			goto out;
	}
	*arglist.lastp = NULL;

	exitstatus = 0;
	loopnest++;
	flags &= EV_TESTED;
	for (sp = arglist.list; sp; sp = sp->next) {
		setvar(n->nfor.var, sp->text, 0);
		evaltree(n->nfor.body, flags);
		if (evalskip) {
			if (evalskip == SKIPCONT && --skipcount <= 0) {
				evalskip = 0;
				continue;
			}
			if (evalskip == SKIPBREAK && --skipcount <= 0)
				evalskip = 0;
			break;
		}
	}
	loopnest--;
 out:
	popstackmark(&smark);
}
---------------------
(gdb) disas 0x816ad35
Dump of assembler code for function handle_sigsegv:
   0x0816acc4 <+0>:	sub    $0xe8,%esp
   0x0816acca <+6>:	mov    0xf4(%esp),%eax
   0x0816acd1 <+13>:	mov    %eax,0xdc(%esp)
   0x0816acd8 <+20>:	mov    0xdc(%esp),%eax
   0x0816acdf <+27>:	mov    0x4c(%eax),%eax
   0x0816ace2 <+30>:	mov    %eax,0xe0(%esp)
   0x0816ace9 <+37>:	mov    0xf0(%esp),%eax
   0x0816acf0 <+44>:	mov    0xc(%eax),%eax
   0x0816acf3 <+47>:	mov    0xe0(%esp),%edx
   0x0816acfa <+54>:	mov    %edx,0x10(%esp)
   0x0816acfe <+58>:	mov    %eax,0xc(%esp)
   0x0816ad02 <+62>:	mov    0xec(%esp),%eax
   0x0816ad09 <+69>:	mov    %eax,0x8(%esp)
   0x0816ad0d <+73>:	movl   $0x8220460,0x4(%esp)
   0x0816ad15 <+81>:	movl   $0x2,(%esp)
   0x0816ad1c <+88>:	call   0x8062370 <dprintf>
   0x0816ad21 <+93>:	movl   $0x32,0x4(%esp)
   0x0816ad29 <+101>:	lea    0x14(%esp),%eax
   0x0816ad2d <+105>:	mov    %eax,(%esp)
   0x0816ad30 <+108>:	call   0x80a8e60 <backtrace>
   0x0816ad35 <+113>:	mov    %eax,0xe4(%esp)
   0x0816ad3c <+120>:	movl   $0x2,0x8(%esp)
   0x0816ad44 <+128>:	mov    0xe4(%esp),%eax
   0x0816ad4b <+135>:	mov    %eax,0x4(%esp)
   0x0816ad4f <+139>:	lea    0x14(%esp),%eax
   0x0816ad53 <+143>:	mov    %eax,(%esp)
   0x0816ad56 <+146>:	call   0x80a8f20 <backtrace_symbols_fd>
   0x0816ad5b <+151>:	movl   $0x270f,(%esp)
   0x0816ad62 <+158>:	call   0x8085b60 <sleep>
   0x0816ad67 <+163>:	jmp    0x816ad5b <handle_sigsegv+151>
End of assembler dump.
---------------------
(gdb) disas 0x39e40c
No function contains specified address.

I don't know, if the above backtrace is really helpful, because I am not sure, that I can re-produce it.
Maybe I should learn how to do a scripted installation and then run it a thousand times :-)

You can download today's test busybox here:
   http://members.aon.at/afp/tmp/busybox_3.bz2

Finally there was a little surprise and possible success right at the end of my tests.
When I tried to shut down via 'kill -USR1 1' after the above SEGV in ash.c, I got some additional information:

*** glibc detected *** /bin/busybox: malloc(): memory corruption: 0x095bb1d8 ***
======= Backtrace: =========
[0x806d3bf]  ... malloc_printerr
[0x806ea4e]  ... _int_malloc
[0x80705d6]  ... malloc ... etc., just like below ...
[0x80d4a8d]
[0x80a2fb6]
[0x80a34d7]
[0x81e5aa2]
[0x81e65d7]
[0x81e665e]
[0x7a9400]
[0x7a9414]
======= Memory map: ========
007a9000-007aa000 r-xp 00000000 00:00 0          [vdso]
08048000-08273000 r-xp 00000000 00:01 5432       /bin/busybox
08273000-08275000 rw-p 0022a000 00:01 5432       /bin/busybox
08275000-0827a000 rw-p 00000000 00:00 0
095b9000-095db000 rw-p 00000000 00:00 0          [heap]
b7700000-b7721000 rw-p 00000000 00:00 0
b7721000-b7800000 ---p 00000000 00:00 0
bf958000-bf979000 rw-p 00000000 00:00 0          [stack]



signal in init.c or friends: 11 address: 0x0 ip: 0x8052f17

[0x81e69ae]  ... handle_sigsegv
[0x7a940c]   ... No function contains specified address.
[0x8052f17]  ... abort
[0x8065975]  ... __libc_message
[0x806d3bf]  ... malloc_printerr
[0x806ea4e]  ... _int_malloc
[0x80705d6]  ... malloc
[0x80d4a8d]  ... open_memstream
[0x80a2fb6]  ... __vsyslog_chk
[0x80a34d7]  ... syslog
[0x81e5aa2]  ... message
[0x81e65d7]  ... run_shutdown_and_kill_processes
[0x81e665e]  ... halt_reboot_pwoff
[0x7a9400]   ... No function contains specified address.
[0x7a9414]   ... No function contains specified address.


Best regards
Franz
Comment 32 Franz A. 2012-03-18 23:35:04 UTC
Created attachment 4166 [details]
Console output of the installation with the SEGV in ash.c, still with free-and-nullify (bzip2)

P.S.: I know, that this is a boring, stuck situation. So, I would understand, if you wished to quit working on that problem. But if you are willing to continue, I will try to help in any way, I can.