| Summary: | httpd does not reap zombies | ||
|---|---|---|---|
| Product: | Busybox | Reporter: | nandhp <nandhp> |
| Component: | Networking | Assignee: | unassigned |
| Status: | RESOLVED WORKSFORME | ||
| Severity: | normal | CC: | busybox-cvs, shmulik.h |
| Priority: | P5 | ||
| Version: | 1.14.x | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Host: | Target: | ||
| Build: | |||
| Attachments: | Strace log which shows that zombie is not created | ||
|
Description
nandhp
2009-06-04 22:20:49 UTC
I cannot reproduce it.
Which version of busybox do you use? What is the command line?
Zombies should be prevented by this line in httpd.c:
xchdir(home_httpd);
if (!(opt & OPT_INETD)) {
signal(SIGCHLD, SIG_IGN); <==== THIS ONE
and in my testing, it does prevent them.
Unfortunately, I can no longer reproduce it either. The zombies may have been due to some issues with my CGI scripts using httpd -d, that I've since fixed. Sorry for wasting your time. I have reproduced a similar issue in my environment. I am using busybox's httpd in an embedded Linux environment provided by ST Microelectronics (STLinux-2.3). Since it doesn't support mime type text/xml by default, I'm adding such support using the CGI script external interpreter mechanism. My /etc/httpd.cong looks like this: *.xml:/sbin/xmlUtil httpd is being launched as a daemon from my init scripts (rcSBB) like this: start-stop-daemon --start --quiet --pidfile /var/run/httpd.pid --exec /usr/sbin/httpd -- -h /home/httpd/html -c /etc/httpd.conf /sbin/xmlUtil is based on the httpd_indexcgi.c sample code provided in busybox under networking. It accepts GET requests for XML files (e.g. ''status.xml'), formats a response in XML format, sends the output back to httpd, and exits. I could see that in the long run, if I access my unit via web several times, the total amount of free memory decreases, and that when I run 'ps', I can see many many zombie process named xmlUtil. I traced the problem in the httpd.c code, and I believe I found the root cause. It seems that when httpd runs the external interpreter using fork() and execv(), it does not wait for the child process to terminate by calling waitpid() as expected. I think that this is causing the new process to remain in memory as a zombie process, and not release its stack and memory (~10K each time). Looking at the history of the file in the repository log, I could see that on 2007-09-23, Denis Vlasenko made a change to the code under the comment: "httpd: simplify CGI i/o loop. -200 bytes" What this change does is move the last part of the code in send_cgi_and_exit() to a different function called cgi_io_loop_and_exit(), but while the old code did call waitpid(), this particular code segment is under #if 0 condition in the new location (lines 1152-1162). The same code remains until today. Is there any reason not to call waitpid() ? Is it possible to restore the code under #if 0 without significant impacts ? I tried to remove the #if 0 condition, but the code does not compile this way since there are now missing variables which were defined in the old code, but not in the new code. It might be possible to change cgi_io_loop_and_exit() to accept the required parameters, but since it is being called from several places in the code, it might not be applicable in all of them. (In reply to comment #3) > I have reproduced a similar issue in my environment. What is your environment? Most importantly, what version of busybox? > I am using busybox's httpd in an embedded Linux environment provided by ST > Microelectronics (STLinux-2.3). Since it doesn't support mime type text/xml by > default, I'm adding such support using the CGI script external interpreter > mechanism. > > My /etc/httpd.cong looks like this: > > *.xml:/sbin/xmlUtil > > httpd is being launched as a daemon from my init scripts (rcSBB) like this: > > start-stop-daemon --start --quiet --pidfile /var/run/httpd.pid --exec > /usr/sbin/httpd -- -h /home/httpd/html -c /etc/httpd.conf > > /sbin/xmlUtil is based on the httpd_indexcgi.c sample code provided in busybox > under networking. It accepts GET requests for XML files (e.g. ''status.xml'), > formats a response in XML format, sends the output back to httpd, and exits. > > I could see that in the long run, if I access my unit via web several times, > the total amount of free memory decreases, and that when I run 'ps', I can see > many many zombie process named xmlUtil. > > I traced the problem in the httpd.c code, and I believe I found the root cause. > It seems that when httpd runs the external interpreter using fork() and > execv(), it does not wait for the child process to terminate by calling > waitpid() as expected. I think that this is causing the new process to remain > in memory as a zombie process, and not release its stack and memory (~10K each > time). Yes, this would cause exited children to remain as zombies, _unless_ SIGCHLD is set to SIG_IGN. But busybox does set it to SIG_IGN - see comment #1. In order to demonstrate it, I ran busybox as follows, and viewed http://127.0.0.1:88/ in by browser: # strace -oLOG -tt -f -s99 ./busybox httpd -f -p88 -vvv -h /.1/video [::ffff:127.0.0.1]:47252: connected [::ffff:127.0.0.1]:47252: url:/ [::ffff:127.0.0.1]:47252: closed Here we see how SIGCHLD is set to SIG_IGN: 15969 21:36:59.217432 execve("./busybox", ["./busybox", "httpd", "-f", "-p88", "-vvv", "-h", "/.1/video"], [/* 32 vars */]) = 0 ... 15969 21:36:59.218715 chdir("/.1/video") = 0 15969 21:36:59.218826 rt_sigaction(SIGCHLD, {SIG_IGN}, {SIG_DFL}, 8) = 0 15969 21:36:59.218954 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 15969 21:36:59.219078 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 ... And here CGI writes out last part of the result, and exits: ... 15971 21:37:02.378285 write(1, "vi</a><td class=sz>631650920<td class=dt>... 15971 21:37:02.378477 poll( <unfinished ...> 15972 21:37:02.378552 <... write resumed> ) = 1654 15972 21:37:02.378622 _exit(0) = ? 15971 21:37:02.378746 <... poll resumed> [{fd=0, events=0}, {fd=3, events=POLLIN, revents=POLLHUP}, {fd=0, events=0}], 3, -1) = 1 15971 21:37:02.378833 read(3, "", 4096) = 0 15971 21:37:02.378937 shutdown(1, 1 /* send */) = 0 15971 21:37:02.379201 write(2, "[::ffff:127.0.0.1]:47252: closed\n", 33) = 33 15971 21:37:02.379546 _exit(0) = ? 15969 21:37:05.224802 <... accept resumed> 0xffcbcaf4, [28]) = ? ERESTARTSYS (To be restarted) See? We did not get SIGCHLD, since kernel knows we aren't interested. Now I Ctrl-C it: 15969 21:37:05.224937 --- SIGINT (Interrupt) @ 0 (0) --- 15969 21:37:05.225069 +++ killed by SIGINT +++ I will attach complete LOG. I also ran "./busybox httpd -f -p88 -vvv -h /.1/video", then refreshed http://127.0.0.1:88/ a dozen times in the browser, then ran ps -A and I definitely do not see any zombies. Created attachment 473 [details]
Strace log which shows that zombie is not created
(In reply to comment #4) > What is your environment? Most importantly, what version of busybox? It is an STLinux-2.3 environment, running kernel 2.6.23.17_stm23_0117 on a STMicro based board (Hitachi SuperH 4 processor). The supplied busybox version is 1.8.2 in source RPM format, patched by STM for compatibility with their init scripts. > Yes, this would cause exited children to remain as zombies, _unless_ SIGCHLD is > set to SIG_IGN. > > But busybox does set it to SIG_IGN - see comment #1. > > In order to demonstrate it, I ran busybox as follows, and viewed > http://127.0.0.1:88/ in by browser: > > # strace -oLOG -tt -f -s99 ./busybox httpd -f -p88 -vvv -h /.1/video > [::ffff:127.0.0.1]:47252: connected > [::ffff:127.0.0.1]:47252: url:/ > [::ffff:127.0.0.1]:47252: closed > > Here we see how SIGCHLD is set to SIG_IGN: > > 15969 21:36:59.217432 execve("./busybox", ["./busybox", "httpd", "-f", "-p88", > "-vvv", "-h", "/.1/video"], [/* 32 vars */]) = 0 > ... > 15969 21:36:59.218715 chdir("/.1/video") = 0 > 15969 21:36:59.218826 rt_sigaction(SIGCHLD, {SIG_IGN}, {SIG_DFL}, 8) = 0 > 15969 21:36:59.218954 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3 > 15969 21:36:59.219078 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > ... > > And here CGI writes out last part of the result, and exits: > > ... > 15971 21:37:02.378285 write(1, "vi</a><td class=sz>631650920<td class=dt>... > 15971 21:37:02.378477 poll( <unfinished ...> > 15972 21:37:02.378552 <... write resumed> ) = 1654 > 15972 21:37:02.378622 _exit(0) = ? > 15971 21:37:02.378746 <... poll resumed> [{fd=0, events=0}, {fd=3, > events=POLLIN, revents=POLLHUP}, {fd=0, events=0}], 3, -1) = 1 > 15971 21:37:02.378833 read(3, "", 4096) = 0 > 15971 21:37:02.378937 shutdown(1, 1 /* send */) = 0 > 15971 21:37:02.379201 write(2, "[::ffff:127.0.0.1]:47252: closed\n", 33) = 33 > 15971 21:37:02.379546 _exit(0) = ? > 15969 21:37:05.224802 <... accept resumed> 0xffcbcaf4, [28]) = ? ERESTARTSYS > (To be restarted) > > See? We did not get SIGCHLD, since kernel knows we aren't interested. Now I > Ctrl-C it: > > 15969 21:37:05.224937 --- SIGINT (Interrupt) @ 0 (0) --- > 15969 21:37:05.225069 +++ killed by SIGINT +++ > > I will attach complete LOG. > > > I also ran "./busybox httpd -f -p88 -vvv -h /.1/video", then refreshed > http://127.0.0.1:88/ a dozen times in the browser, then ran ps -A and I > definitely do not see any zombies. > But what did the web browser request from httpd in busybox? I could see that if the browser requests regular files that it can serve locally from /home/httpd/html/, or if it requests a cgi-bin script that resides under /home/httpd/cgi-bin/ - everything works OK. But, if it requires an execution of an external interpreter, using the PHP external interpreter method, I get the described problem. Is it possible that by the phrase "external interpreter" it is meant to be a shell (or other) script, and not a C application so the shell can take care of such signals? Could it be that the C application should be implemented differently than the httpd_indexcgi.c sample code, and maybe handle such signals on its own? If so, could you point me to an example? (In reply to comment #6) > (In reply to comment #4) > > What is your environment? Most importantly, what version of busybox? > > It is an STLinux-2.3 environment, running kernel 2.6.23.17_stm23_0117 on a > STMicro based board (Hitachi SuperH 4 processor). > The supplied busybox version is 1.8.2 in source RPM format, patched by STM for > compatibility with their init scripts. Can you send the RPM to vda.linux@googlemail.com? > > See? We did not get SIGCHLD, since kernel knows we aren't interested. Now I > > Ctrl-C it: > > > > 15969 21:37:05.224937 --- SIGINT (Interrupt) @ 0 (0) --- > > 15969 21:37:05.225069 +++ killed by SIGINT +++ > > > > I will attach complete LOG. > > > > > > I also ran "./busybox httpd -f -p88 -vvv -h /.1/video", then refreshed > > http://127.0.0.1:88/ a dozen times in the browser, then ran ps -A and I > > definitely do not see any zombies. > > > > But what did the web browser request from httpd in busybox? > > I could see that if the browser requests regular files that it can serve > locally from /home/httpd/html/, or if it requests a cgi-bin script that resides > under /home/httpd/cgi-bin/ - everything works OK. You can see it in strace log I attached. Please read it. It requests /, and httpd runs cgi-bin/index.cgi: 15971 21:37:02.334556 read(0, "GET / HTTP/1.1\r\nHost: ..."..., 1025) = 374 15971 21:37:02.334719 write(2, "[::ffff:127.0.0.1]:47252: url:/\n", 32) = 32 ... 15971 21:37:02.335707 alarm(60) = 60 15971 21:37:02.335803 alarm(0) = 60 15971 21:37:02.335899 stat64("index.html", 0xffcbc998) = -1 ENOENT 15971 21:37:02.336006 access("cgi-bin/index.cgi", X_OK) = 0 15971 21:37:02.336119 stat64("cgi-bin", {st_mode=S_IFDIR|0755, st_size=80, ...}) = 0 15971 21:37:02.336288 pipe([3, 5]) = 0 15971 21:37:02.336395 pipe([6, 7]) = 0 15971 21:37:02.336503 vfork( <unfinished ...> ...pid 15969 syscalls skipped (they are irrelevant)... 15971 21:37:02.337061 <... vfork resumed> ) = 15972 15972 21:37:02.337148 close(7) = 0 15972 21:37:02.337245 close(3) = 0 15972 21:37:02.337345 dup2(6, 0) = 0 15972 21:37:02.337442 close(6) = 0 15972 21:37:02.337538 dup2(5, 1) = 1 15972 21:37:02.337633 close(5) = 0 15972 21:37:02.337746 chdir("cgi-bin") = 0 15972 21:37:02.337861 rt_sigaction... {skipped a few of these} 15972 21:37:02.338209 execve("index.cgi", ["index.cgi"], [/* 47 vars */]) = 0 > But, if it requires an execution of an external interpreter, using the PHP > external interpreter method, I get the described problem. > > Is it possible that by the phrase "external interpreter" it is meant to be a > shell (or other) script, and not a C application so the shell can take care of > such signals? httpd uses the same method for all CGIs. > Could it be that the C application should be implemented differently than the > httpd_indexcgi.c sample code, and maybe handle such signals on its own? I think all CGIs should just dump result to stdout and exit, regardless of the language they use (C, shell, PHP etc). No special handling of exit signal is needed. Can you produce a strace log and show "ps -AH" output fragment which shows zombies being created? (In reply to comment #7) > Can you send the RPM to vda.linux@googlemail.com? Sent. > httpd uses the same method for all CGIs. OK > I think all CGIs should just dump result to stdout and exit, regardless of the > language they use (C, shell, PHP etc). No special handling of exit signal is > needed. Great - this is exactly what my program does :-) > Can you produce a strace log and show "ps -AH" output fragment which shows > zombies being created? 2 problems here: 1) busybox on my unit does not support strace - I'll need to rebuild 2) busybox's ps command does not support AH options I have the output from 'ps -w' - this is after refreshing the web browser 4 times: PID Uid VSZ Stat Command 1 root 2568 S init 2 root SW< [kthreadd] 3 root SW< [ksoftirqd/0] 4 root SW< [events/0] 5 root SW< [khelper] 31 root SW< [kblockd/0] 36 root SW< [kseriod] 98 root SW [pdflush] 99 root SW [pdflush] 100 root SW< [kswapd0] 101 root SW< [aio/0] 181 root SW< [mtdblockd] 182 root SW< [ftld] 214 root SW< [rpciod/0] 216 root 2568 S init 217 root 2572 S /bin/sh /etc/init.d/rcSBB 270 root 2568 S /sbin/klogd 275 root 2568 S /sbin/syslogd 280 root 4068 S /usr/sbin/sshd 307 root 2568 S /usr/sbin/telnetd -f /etc/issue.net 332 root 2568 S /usr/sbin/httpd -h /home/httpd/html -c /etc/httpd.conf 415 root SWN [jffs2_gcd_mtd1] 433 root Z [tar] 495 root SW< [EMBXSHM-NewPort] 496 root SW< [EMBXSHM-PortClo] 497 root SW< [EMBXSHM-NewPort] 498 root SW< [EMBXSHM-PortClo] 501 root SW< [STFDMA_ClbckMgr] 749 root 179272 S /usr/bin/irdapp 752 root SW< [stpti4_IntTask] 753 root SW< [stpti4_EvtTask] 754 root SW< [STCLKRV_Recover] 755 root DW< [STVOUT_STATE_MA] 756 root SW< [STVOUT_INFOFRAM] 757 root SW< [kblit_interrupt] 758 root DW< [STLAYER-GFX/CUR] 759 root DW< [STLAYER-GFX/CUR] 761 root SW< [PESES0] 762 root SW< [DEC0] 763 root SW< [PP0] 764 root SW< [PP1] 765 root SW< [PCMPLAYER0] 766 root SW< [PCMPLAYER1] 767 root SW< [SPDIFPLAYER] 768 root SW< [STSUBT_FILTER_T] 769 root SW< [STSUBT_PROCESSO] 770 root SW< [STSUBT_ENGINE_T] 771 root SW< [STSUBT_TIMER_TA] 798 root Z [xmlUtil] 800 root Z [xmlUtil] 806 root Z [xmlUtil] 834 root Z [xmlUtil] 835 root 2568 S sh -c /bin/ps -w 836 root 2572 R /bin/ps -w Besides, I noticed that you are running httpd with -f option that prevents it from daemonizing - could this be the cause of the difference? (In reply to comment #8) > (In reply to comment #7) > > Can you send the RPM to vda.linux@googlemail.com? > > Sent. I don't see it, but nevermind. I see where the problem is. > > Can you produce a strace log and show "ps -AH" output fragment which shows > > zombies being created? > > 2 problems here: > 1) busybox on my unit does not support strace - I'll need to rebuild > 2) busybox's ps command does not support AH options > > I have the output from 'ps -w' - this is after refreshing the web browser 4 > times: > > PID Uid VSZ Stat Command > 1 root 2568 S init ... > 332 root 2568 S /usr/sbin/httpd -h /home/httpd/html -c > /etc/httpd.conf > 415 root SWN [jffs2_gcd_mtd1] > 433 root Z [tar] Hmm... what is *this*? > 798 root Z [xmlUtil] > 800 root Z [xmlUtil] > 806 root Z [xmlUtil] > 834 root Z [xmlUtil] Yep. Here they are. But there are no corresponding zombies of *httpd slaves*. There are three processes running when you run a single CGI session: 1. httpd "master", which spawns new https "slaves" - in your listing, master has PID 332; 2. httpd "slave", which sends files, or spawns CGIs and then pumps data to/from CGI<->network; and 3. CGI process. Note that you do not see any of slaves in your ps -w. Because they successfully detected EOF from CGI and exited. _Before CGI itself managed to exit_! What happens in this case? CGI (and any other program) gets reparented to init if its parent dies. When CGI dies, it will become a zombie until _init_ "waits for it" (runs wait[pid] syscall). And I think your init is broken. It does not do that. Why I think so? Because I see another unreaped zombie - [tar]. I don't think it is related to httpd! :) You need to fix your init. How to test whether your init is really broken: Run a few times: sh -c 'sleep 1 & kill -9 $$' (a shell which starts a child, and then kills itself) them look at ps output. If you see zombies of "sleep", then init is buggy. (In reply to comment #10) > How to test whether your init is really broken: > Run a few times: > > sh -c 'sleep 1 & kill -9 $$' > > (a shell which starts a child, and then kills itself) > > them look at ps output. If you see zombies of "sleep", then init is buggy. > I believe you are correct - I can see several zombie processes called [sleep] Could this be a bug in busybox 1.8.2 that was later fixed ? If so I can ask STLinux to upgrade their supplied package to 1.14.2. This is what I'm seeing on the serial console: ... init started: BusyBox v1.8.2 (2009-07-15 14:10:39 IDT) starting pid 217, tty '': '/etc/init.d/rcSBB' Activating swap. Checking all file systems... fsck (busybox 1.8.2, 2009-07-15 14:10:39 IDT) Setting up networking...done. Setting up IP spoofing protection: rp_filter. Disable TCP/IP Explicit Congestion Notification: done. Configuring network interfaces: done. Initializing random number generator...done. Starting kernel log daemon: klogd. Starting system log daemon: syslogd. Starting sshd:ok Starting telnetd: ok Starting httpd: ok ... /etc/initabBB looks like this (also provided by ST): # Example Busybox inittab ::sysinit:/etc/init.d/rcSBB ttyAS0::askfirst:/bin/sh #ttyAS1::askfirst:/bin/sh # Put a getty on the serial line (for a terminal) #::respawn:/sbin/getty -L ttyAS0 115200 vt102 ::ctrlaltdel:/sbin/reboot ::shutdown:/sbin/swapoff -a ::shutdown:/bin/umount -a -r ::restart:/sbin/init I can see that serial console is not started normally with getty (to avoid the need to login) - could this be part of the problem? One more problem I could see which may be related: If I interrupt the boot process and don't let the rcSBB script load the loadable modules and run our application (a debug option we are using), I can run 'ps' (either from the serial console and from telnet). If I let the boot go all the way and then connect from telnet, I can't run 'ps' and I get this: # ps /bin/ps: invalid option -- T BusyBox v1.8.2 (2009-07-15 14:10:39 IDT) multi-call binary Usage: ps Report process status Options: w Wide output > Could this be a bug in busybox 1.8.2 that was later fixed ?
Yes, the bug was fixed in 1.10.x branch
(In reply to comment #12) > > Could this be a bug in busybox 1.8.2 that was later fixed ? > > Yes, the bug was fixed in 1.10.x branch > Thanks. With this information, I'm moving forward to get the STLinux maintainers to integrate the latest busybox code into their environment. I believe this bug can be closed. In case they do not want to switch to a more recent version, they need at least patch this bug in init.c
The bug in 1.8.x was here:
static int waitfor(const struct init_action *a, pid_t pid)
{
int runpid;
int status, wpid;
runpid = (NULL == a) ? pid : run(a);
while (1) {
wpid = waitpid(runpid, &status, 0);
if (wpid == runpid)
break;
if (wpid == -1 && errno == ECHILD) {
/* we missed its termination */
break;
}
/* FIXME other errors should maybe trigger an error, but allow
* the program to continue */
}
return wpid;
}
The problem is, waitpid(runpid, &status, 0) would not wait for OTHER processes (with pids != runpid).
The fix is to use waitpid(-1, &status, 0), which waits for any process.
I would like to thank you guys for the support. As requested, STLinux maintainers picked up the sources of the latest stable release, incorporated it into their repository (after patching necessary parts) and released it. I can see that indeed the problem has now disappeared. Please consider this bug as closed. Thanks again, Shmulik. |