| Summary: | ash behaves strangely on CTRL-C (does not exit but fills process table) | ||
|---|---|---|---|
| Product: | Busybox | Reporter: | Ronald Wahl <ronald.wahl> |
| Component: | Other | Assignee: | unassigned |
| Status: | RESOLVED FIXED | ||
| Severity: | major | CC: | busybox-cvs |
| Priority: | P3 | ||
| Version: | 1.13.x | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Host: | Target: | ||
| Build: | |||
| Attachments: |
working .config
Proposed fix trace of executing script.sh (with manually inserted CTRL-C marker) see shell continiously spawning/finishing subshells after SIGINT+SIGUSR1 with SIGINT+SIGUSR1 & expanded tracing |
||
Please attach your .config, it does not happen with mine (which I will attach too). Let's compare them. Created attachment 153 [details]
working .config
Your .config shows the same problem (had to remove the cross prefix - i just built it native against glibc). My .config is the default config (make menuconfig and just save). But I have just noticed that this bug is a race condition because in some cases it works. So probably you need to try just harder... Meanwhile I know even a bit more: The working case looks like this: - SIGINT is received - onsig() is called - raise_interrupt() is called - we kill ourself there with SIGINT - shell finishes The broken case looks like this: - SIGINT received - onsig() is called - dotrap() is called (dunno why) which resets gotsig[SIGINT - 1] - raise_interrupt() is called - SIGINT check fails because it was reset in dotrap() - shell does NOT finish So the question is why dotrap() is called in the broken case. Maybe interesting for further inspections: "suppressint" was always 1 in onsig() during my tests. So raise_interrupt is not called from onsig() directly. Created attachment 155 [details]
Proposed fix
Please review my proposed patch. It works for me but I'm not sure if its correct.
Probably we also need to move the "pendingsig = 0;" outside of the inner if statement, to catch the case that pendingsig is 1 already and a SIGINT occurs when INTs are supposed to be suppressed. Semantically we need to achieve that when intpending is 1 then pendingsig must be 0 otherwise the shell might not exit on SIGINT. Alternatively we could call dotrap() only when pendingsig is 1 and intpending is 0 but this might be racy as well because here the check is not signal safe because it is outside of the signal handler. Oh well, signal handling is an awful topic ... ;-) I think it needs to be made like this:
gotsig[signo - 1] = 1;
// pendingsig = signo;
if (/* exsig || */ (signo == SIGINT && !trap[SIGINT])) {
if (!suppressint) {
pendingsig = 0;
raise_interrupt(); /* does not return */
}
intpending = 1;
} else
pendingsig = signo;
IOW, pendingsig is 1 if signal OTHER than SIGINT is pending, and intpending is for SIGINT. Currently, it looks like pendingsig will be set for SIGINT too, and then SIGINT handling sees that and handles SIGINT again as if it is an "other" signal.
Your proposal "if (!intpending) pendingsig = signo;" will make pending SIGINT ignore any other signal.
> need to add "pendingsig = 0;" outside of the inner if statement, to catch the case that pendingsig is 1 already and a SIGINT occurs when INTs are supposed to be suppressed.
But that is a valid scenario. Other signal occurred, then user pressed ^C. Shall we forget about "other signal"? I don't think so.
Please let me know how the above fix works for you. It seems to work for me (and yes, I managed to reproduce the bug)
Your fix (like mine) does fix the problem if only one signal is pending (SIGINT). I understand that there might be scenarios where the signal handlers for other signals should be handled before we are killed with SIGINT. The problem with intpending=1 and pendingsig!=0 is that in this case the CTRL-C is lost because dotrap() might be called before raise_interrupt() is called which ends in the same broken scenario described. You can easily see this when you follow these steps: (1) Modify the script to have an handler for USR1 which does just a "return 0". (2) Start the script. (3) Stop the script with CTRL-Z. (4) Send SIGUSR1 to the script. (5) Send SIGINT to the script _and_ the sleep command. (6) Continue the script with "fg". Now you'll see the broken behavior again. Same happens if (4) and (5) are interchanged. I tried to skip resetting gotsig[SIGINT-1] in dotrap() when sigpending is 1 but this ends in a loop. But maybe you know how to do it better... (In reply to comment #8) > The problem with intpending=1 and pendingsig!=0 is that in this case the CTRL-C > is lost because dotrap() might be called before raise_interrupt() is called > which ends in the same broken scenario described. You can easily see this when > you follow these steps: > > (1) Modify the script to have an handler for USR1 which does just a "return 0". > (2) Start the script. > (3) Stop the script with CTRL-Z. > (4) Send SIGUSR1 to the script. > (5) Send SIGINT to the script _and_ the sleep command. > (6) Continue the script with "fg". > > Now you'll see the broken behavior again. Same happens if (4) and (5) are > interchanged. Unfortunately, it does not happen to me. I will rely on you to do the testing. > I tried to skip resetting gotsig[SIGINT-1] in dotrap() when sigpending is 1 > but this ends in a loop. But maybe you know how to do it better... Earlier you said: The broken case looks like this: - SIGINT received - onsig() is called - dotrap() is called (dunno why) which resets gotsig[SIGINT - 1] - raise_interrupt() is called - SIGINT check fails because it was reset in dotrap() - shell does NOT finish In the new case (SIGUSR1 + SIGINT + ash patched to not set pendingsig on SIGINT), calling dotrap() is no longer a mystery or wrong, it does need to be called because of SIGUSR1. One thing we can do is to not clear gotsig[SIGINT - 1] in dotrap() if there is no handler for SIGINT, thus allowing raise_interrupt() to see it. But before you try this fix - can you investigate further how exactly ash manages to duplicate itself?! Ok, raise_interrupt() gets confused and it falls through into raise_exception(EXSIG); which longjmp()s somewhere. It is still not clear how this leads to fork() and duplicate ash process... worth looking into because it may be a separate bug. After you found out how that happens, try the fix a-la this in dotrap(): ... for (i = 1, q = gotsig; i < NSIG; i++, q++) { if (*q == 0) continue; p = trap[i]; /* non-trapped SIGINT is handled separately by raise_interrupt, * don't upset it by resetting gotsig[SIGINT-1] */ if (i == SIGINT && !p) continue; *q = 0; if (!p) continue; skip = evalstring(p, SKIPEVAL); ... You can find it in hot-fix directory (diffed against 1.13.3): http://busybox.net/downloads/fixes-1.13.3/busybox-1.13.3-ash.patch Created attachment 165 [details]
trace of executing script.sh (with manually inserted CTRL-C marker)
The attachment shows the trace of the script below. I inserted the point were I pressed CTRL-C once. The fork after CTRL-C is simply the restart of the sleep command after the first sleep finished because CTRL-C.
--------- script.sh --------------
func () {
sleep 9999
}
handler () {
return 0
}
trap handler USR1
while (true); do
func
done
----------------------------------
Created attachment 167 [details]
see shell continiously spawning/finishing subshells after SIGINT+SIGUSR1
Your fix I tried already and this "loops" now. I mean the original shell forks continously new sub shells and finishes them until "killall ash". See the attached trace.
dowait(0x0) called
-- SIGUSR1 & SIGINT --
wait returns pid=-1, status=0x85f1420
token word handler
pipeline: entered
reread token word handler
reread token word handler
reread token word handler
token end of file
reread token end of file
reread token end of file
reread token end of file
pid 1232, evaltree(0x85f0afc: 0, 0) called
evalcommand(0x85f0afc, 0) called
evalcommand arg: handler
dowait(0x1) called
wait returns pid=0, status=0x0
pid 1232, evaltree(0x85f15ec: 0, 0) called
evalcommand(0x85f15ec, 0) called
evalcommand arg: return
evalcommand arg: 0
dowait(0x1) called
wait returns pid=0, status=0x0
token end of file
up to here it looks more or less ok, SIGUSR1 handler was executed
when dowait() reported EINTR. I guess here:
blocking_wait_with_raise_on_sig(struct job *job)
{
pid_t pid = dowait(DOWAIT_BLOCK, job);
if (pid <= 0 && pendingsig)
raise_exception(EXSIG);
return pid;
}
But now I'm confused:
pid 1232, evaltree(0x85f0a2c: 4, 2) called
makejob(1) returns %2
forkshell(%2, 0x85f0a2c, 0) called
In parent shell: child = 1239
It does not seem to notice that intpending is set!
Can you expand the tracing? Add " s:%d i:%d\n", ..., pendingsig, intpending
to these TRACE statements so that we can see whetehr those are set:
TRACE(("dowait(0x%x) called\n", wait_flags));
TRACE(("wait returns pid=%d, status=0x%x\n", pid, status));
ALso, can you test we can trigger this bug on simpler script? Running (true) in subshell is complication trace needlessly.
Try this one:
trap "echo USR1" USR1
while true; do
echo Sleeping
sleep 1
done
Thanks for helping with debugging.
Created attachment 171 [details]
with SIGINT+SIGUSR1 & expanded tracing
See the attached trace. I used your stripped down script.
Ok, I took deeper look into this mess. I went back to original bug, and used this script, let's call it bug1.sh: trap "echo USR1" USR1 while true; do echo Sleeping sleep 5 done Was running it under ash from 1.13.3 (reproduced ^C bug easily) and was adding TRACE(()) machinery until I saw where the bug is. There are two of them. First one is that in evaltree, when we check for EXSIG exception: int err = setjmp(jmploc.loc); if (err) { /* if it was a signal, check for trap handlers */ if (exception == EXSIG) goto out; we do go to dotrap() via "goto out", BUT WE DONT RESTORE suppressint counter! It easily may be left too big. New, more verbose trace, looks like this: 1237485605 [4865] pending s:2 i:1(supp:1) wait returns pid=-1, status=0x0, errno=4(Interrupted system call) 1237485605 [4865] pending s:2 i:1(supp:1) raising exception 5 1237485605 [4865] pending s:2 i:1(supp:2) exception 5 in evaltree 1237485605 [4865] pending s:0 i:1(supp:2) dotrap entered 1237485605 [4865] pending s:0 i:1(supp:2) sig 2 is active, will run handler '(null)' 1237485605 [4865] pending s:0 i:1(supp:2) dotrap returns 0 1237485605 [4865] pending s:0 i:1(supp:2) pid 4865, evaltree(0x12e3350: 0, 2) called 1237485605 [4865] pending s:0 i:1(supp:2) evalcommand(0x12e3350, 2) called 1237485605 [4865] pending s:0 i:1(supp:2) evalcommand arg: true 1237485605 [4865] pending s:0 i:1(supp:2) dowait(0x1) called See "supp:2"? That's suppressint which is stuck one too big, and ^C won't be handled. I added SAVE_INT/RESTORE_INT around entire evaltree, and then I see (with enev more debug): 1237486866 [7240] pending s:2 i:1(supp:1) wait returns pid=-1, status=0x0, errno=4(Interrupted system call) 1237486866 [7240] pending s:2 i:1(supp:1) raising exception 5 on line 3821 1237486866 [7240] pending s:2 i:1(supp:2) exception 5 (EXSIG) in evaltree, err=1 1237486866 [7240] pending s:0 i:1(supp:2) dotrap entered 1237486866 [7240] pending s:0 i:1(supp:2) sig 2 is active, will run handler '(null)' 1237486866 [7240] pending s:0 i:1(supp:2) dotrap returns 0 1237486866 [7240] pending s:0 i:1(supp:0) raising interrupt on line 8082 1237486866 [7240] pending s:0 i:0(supp:0) raising exception 5 on line 336 1237486866 [7240] pending s:0 i:0(supp:1) exception 5 (EXSIG) in evaltree, err=1 1237486866 [7240] pending s:0 i:0(supp:0) leaving evaltree 1237486866 [7240] pending s:0 i:0(supp:0) evaltree(0x8a0350: 0, 2) called 1237486866 [7240] pending s:0 i:0(supp:0) evalcommand(0x8a0350, 2) called 1237486866 [7240] pending s:0 i:0(supp:0) evalcommand arg: true 1237486866 [7240] pending s:0 i:0(supp:0) dowait(0x1) called "raising interrupt on line 8082" is ok, but it doesn't work. Well, *this* is our old bug, already fixed - gotsig[SIGINT - 1] was erroneously cleared by dotrap() I am adding SAVE_INT/RESTORE_INT fix to ash hotfix: http://busybox.net/downloads/fixes-1.13.3/busybox-1.13.3-ash.patch If it still fails for you, can you produce a DEBUG 2 trace of bug1.sh under SIGUSR1+SIGINT test? Incorporated the fix in rev 25762. Need confirmation that it fixes the problem. I tested http://busybox.net/downloads/fixes-1.13.3/busybox-1.13.3-ash.patch and everything is fine now. Nice work. Thanks for fixing this, Denys! |
Hi, the below script behaves strangely when CTRL-C is pressed. Each CTRL-C doubles the running subshells(?). Pressing CTRL-C longer (1-2 secs) will quickly make small devices unusable (fork failed etc.). When registering a trap handler everything is fine but I think the default handler for SIGINT should just abort the whole script. See: $ /home/rwa/busybox-1.13.3/ash script.sh (other term) $ ps ax|grep script.sh 3896 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh $ /home/rwa/busybox-1.13.3/ash script.sh ^C (other term) $ ps ax|grep script.sh 3896 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh 3930 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh $ /home/rwa/busybox-1.13.3/ash script.sh ^C^C (other term) $ ps ax|grep test 3896 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh 3930 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh 4627 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh 4628 pts/7 S+ 0:00 /home/rwa/busybox-1.13.3/ash script.sh The script: func () { sleep 1 } while (true); do func done