Bug 13856 - Wrong PID written by start-stop-daemon -S -b -m -p
Summary: Wrong PID written by start-stop-daemon -S -b -m -p
Status: NEW
Alias: None
Product: Busybox
Classification: Unclassified
Component: Other (show other bugs)
Version: 1.33.x
Hardware: Other Linux
: P5 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-15 16:32 UTC by mwadsten
Modified: 2021-06-20 11:20 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:


Attachments
.config file used in build (31.47 KB, application/octet-stream)
2021-06-15 16:32 UTC, mwadsten
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mwadsten 2021-06-15 16:32:38 UTC
Created attachment 9001 [details]
.config file used in build

After updating an embedded Linux system (using a Linux kernel 2.6.35) from busybox 1.20.2 to 1.33.1, I found that certain init scripts using start-stop-daemon were causing messages like "start-stop-daemon: warning: killing process XXX: No such process" to be printed to the console.

My investigation found that the PID file created when using start-stop-daemon -S -b -m -p PIDFILE has the wrong PID. The PID inside the file would be 2 less than the correct PID.

I am able to demonstrate the issue with the following:

    /tmp # cat ssd-test.sh
    #!/bin/ash

    exec >> /tmp/ssd-test.txt
    echo ssd-test: PID is $$

    sleep 5

    exit 0
    /tmp # start-stop-daemon -S -q -b -m -p /tmp/test.pid -x /tmp/ssd-test.sh ; ps ax|grep ash ; sleep 1; tail /tmp/ssd-test.txt ; cat /tmp/test.pid
     2478 root      0:00 {ssd-test.sh} /bin/ash /tmp/ssd-test.sh
     2480 root      0:00 grep ash
    ssd-test: PID is 2478
    2476
    /tmp #

As you can see, the process has PID 2478, but the PID file says 2476.

I determined that the changes in this commit (specifically switching to xvfork on all systems) is the cause: https://git.busybox.net/busybox/commit/?id=088fec36fedff2cd50437c95b7fb430abf8d303c . Changing the uses of xvfork in debianutils/start-stop-daemon.c to xfork resolves the issue.

This might be a compatibility issue with the 2.6.35 kernel, or whatever version of glibc is present on this platform. I am unable to reproduce this with a host build of the same code, running under WSL2 (Ubuntu 20.04; Linux 5.4.72). I had a coworker try my reproduction on another, newer Linux-based system (kernel 5.x?), and we were unable to reproduce it there either. (That system also uses musl instead of glibc.)

In any case, it is technically incorrect to use vfork the way that debianutils/start-stop-daemon.c does, because there is no defined guarantee that the result of getpid() in the grandchild process is correct.
Comment 1 Denys Vlasenko 2021-06-16 09:54:13 UTC
(In reply to mwadsten from comment #0)
> In any case, it is technically incorrect to use vfork the way
> that debianutils/start-stop-daemon.c does, because there is no
> defined guarantee that the result of getpid() in the grandchild process is correct.

Can you expand on this? Where does it say that getpid() after vfork() is not guaranteed to return correct value?
Comment 2 Denys Vlasenko 2021-06-16 10:19:40 UTC
(In reply to mwadsten from comment #0)
> This might be a compatibility issue with the 2.6.35 kernel, or whatever
> version of glibc is present on this platform. I am unable to reproduce
> this with a host build of the same code, running under WSL2
> (Ubuntu 20.04; Linux 5.4.72). I had a coworker try my reproduction on another,
> newer Linux-based system (kernel 5.x?), and we were unable to reproduce
> it there either. (That system also uses musl instead of glibc.)

I vaguely remember that glibc used to have a pid caching (mis)feature: to save a few microseconds, they were remembering result of first getpid() call and avoided a syscall if getpid() was called again.

Try
strace busybox lsof 2>&1 | grep getpid
to see whether "pid caching" is happening. If yes, you'll see only one getpid() call.
Comment 3 mwadsten 2021-06-16 14:38:07 UTC
Looks like you remembered correctly - glibc 2.25 (released 2017-02-05) removed the getpid caching behavior. https://sourceware.org/glibc/wiki/Release/2.25#pid_cache_removal

Do you know of any other recent busybox updates which rely on newer glibc versions (or other parts of the system) that we should be aware of? I don't see any callouts like "busybox now requires glibc 1.x+" on busybox.net.
Comment 4 Denys Vlasenko 2021-06-16 20:55:38 UTC
(In reply to mwadsten from comment #3)
> Do you know of any other recent busybox updates which rely on newer glibc versions (or other parts of the system) that we should be aware of?

None. Busybox should work okay with pid cache as well - it does not use direct clone syscall or something like that. I assume you see a glibc bug wrt pid cache (maybe it's not invalidated on vfork).
Comment 5 mwadsten 2021-06-17 15:09:22 UTC
strace busybox lsof 2>&1 | grep getpid
is actually giving an empty response on this system. I did it without grep and I definitely don't see any getpid calls.

I also don't see getpid if I do strace -F start-stop-daemon ... (not even from the resolution of $$). Maybe that IS a sign that this glibc is caching the getpid result (and not reaching a syscall). If I'm reading the features.h header correctly, it's glibc 2.12.

Given that the glibc 2.25 release notes explicitly say "it was deemed safer to remove the cache than to potentially return a wrong answer" in regards to removing the PID cache, I assume that is the root cause here.

In that case, would you agree that we can characterize this as a busybox compatibility issue with glibc < 2.25?
Comment 6 Denys Vlasenko 2021-06-20 11:20:49 UTC
This is rather weird. I suggest you dig deeper, like trying

int main() { return getpid(); }

under strace and gdb.