| Summary: | ntpd: fails to sync if bad servers first in list | ||
|---|---|---|---|
| Product: | Busybox | Reporter: | Karl Palsson <karlp> |
| Component: | Networking | Assignee: | unassigned |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | busybox-cvs |
| Priority: | P5 | ||
| Version: | 1.27.x | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Host: | Target: | ||
| Build: | |||
| Attachments: |
failing case, good server last in list
good case, good server at start of list busybox config still failing after patching with timestamps and extra debug post-patch, only dnsnamed hosts comparison log from 1.23.2 final log, all fixed and working :) |
||
Created attachment 7336 [details]
good case, good server at start of list
Created attachment 7341 [details]
busybox config
Indeed, DNS resolution delay...
Fixed in git this way:
--- a/networking/ntpd.c
+++ b/networking/ntpd.c
@@ -866,10 +866,8 @@ do_sendto(int fd,
static void
send_query_to_peer(peer_t *p)
{
- if (!p->p_lsa) {
- if (!resolve_peer_hostname(p))
- return;
- }
+ if (!p->p_lsa)
+ return;
/* Why do we need to bind()?
* See what happens when we don't bind:
@@ -2360,6 +2358,14 @@ int ntpd_main(int argc UNUSED_PARAM, char **argv)
int nfds, timeout;
double nextaction;
+ /* Resolve peer names to IPs, if not resolved yet */
+ for (item = G.ntp_peers; item != NULL; item = item->link) {
+ peer_t *p = (peer_t *) item->data;
+
+ if (p->next_action_time <= G.cur_time && !p->p_lsa)
+ resolve_peer_hostname(p);
+ }
+
/* Nothing between here and poll() blocks for any significant time */
nextaction = G.cur_time + 3600;
This seems to improve things, but it doesn't fix them for me. New log from patched version attached Created attachment 7351 [details]
still failing after patching
Seems to be "better" but still gets very blocked up and never sends out stratum/step events. Still behaves quite badly with more bad servers, and also bad servers specified by IP that are unreachable
Can you run ntpd with more verbosity (-ddd) and timestamp the output? Say, via CMD 2>&1 | while read line; do echo "`date +%H:%M:%S.%N` $line"; done Created attachment 7356 [details]
with timestamps and extra debug
ntpd is correctly stepping time at least, but that's all we get. We still never consider ourselves in sync. (responds to ntpdate -q queries, nor sends out stratum change events)
Sorry I didn't have more debug enabled earlier, it wasn't obvious (to me) from the help text that I could provide multiple -d flags.
Created attachment 7361 [details] post-patch, only dnsnamed hosts Similar to https://bugs.busybox.net/attachment.cgi?id=7356&action=edit , just for comparison. This log doesn't have any bad ip addresses, only names. In this run, it didn't even manage to step the clock. Created attachment 7366 [details]
comparison log from 1.23.2
Just comparison log from older busybox. This is from before the dns resolution changes and their fixes came in. step and sync achived in ~60 seconds
I can't reproduce it here (my DNS server caches failures to resolve, so 10-second delay does not happen). Thanks for your debugging. I reworked the code a bit more. Please try current git. Created attachment 7371 [details]
final log, all fixed and working :)
Lovely, all smooth now.
Steps and syncs nice and fast. I also added in more bad names, and some bad IPs and it handles it all nicely now.
|
Created attachment 7331 [details] failing case, good server last in list Running 1.27.2 on LEDE(OpenWrt) specifically to pick up all the recent ntpd fixes. If there is a good server at the _start_ of the list, all bad servers are successfully ignored. If the first server in the list is bad however, ntpd will send and receive replies, but never step time, nor issue a stratum event indicated sync has been achieved. Logs (with annotations) from both cases attached separately. It's important to note that this cannot be tested with typical "bad" hosts like bad.example.org. Those are explicitly recognised and return a dns failure "immediately" You need a good proper dns timeout. It seems that it sends the query to the good server, then, instead of processing the reply, it goes and tries dns resolution again for the bad servers, before finally processing the (now old) reply, and discarding? it.