Bug 10466 - ntpd: fails to sync if bad servers first in list
Summary: ntpd: fails to sync if bad servers first in list
Status: RESOLVED FIXED
Alias: None
Product: Busybox
Classification: Unclassified
Component: Networking (show other bugs)
Version: 1.27.x
Hardware: All Linux
: P5 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-27 16:48 UTC by Karl Palsson
Modified: 2017-10-31 13:34 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:


Attachments
failing case, good server last in list (2.57 KB, text/x-log)
2017-10-27 16:48 UTC, Karl Palsson
Details
good case, good server at start of list (2.83 KB, text/x-log)
2017-10-27 16:49 UTC, Karl Palsson
Details
busybox config (32.00 KB, text/plain)
2017-10-27 16:49 UTC, Karl Palsson
Details
still failing after patching (6.38 KB, text/x-log)
2017-10-30 17:29 UTC, Karl Palsson
Details
with timestamps and extra debug (8.32 KB, text/x-log)
2017-10-31 10:11 UTC, Karl Palsson
Details
post-patch, only dnsnamed hosts (10.78 KB, text/x-log)
2017-10-31 10:23 UTC, Karl Palsson
Details
comparison log from 1.23.2 (5.72 KB, text/x-log)
2017-10-31 10:38 UTC, Karl Palsson
Details
final log, all fixed and working :) (4.62 KB, text/x-log)
2017-10-31 13:34 UTC, Karl Palsson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Karl Palsson 2017-10-27 16:48:33 UTC
Created attachment 7331 [details]
failing case, good server last in list

Running 1.27.2 on LEDE(OpenWrt) specifically to pick up all the recent ntpd fixes.

If there is a good server at the _start_ of the list, all bad servers are successfully ignored.  If the first server in the list is bad however, ntpd will send and receive replies, but never step time, nor issue a stratum event indicated sync has been achieved.

Logs (with annotations) from both cases attached separately.

It's important to note that this cannot be tested with typical "bad" hosts like bad.example.org.  Those are explicitly recognised and return a dns failure "immediately"  You need a good proper dns timeout.

It seems that it sends the query to the good server, then, instead of processing the reply, it goes and tries dns resolution again for the bad servers, before finally processing the (now old) reply, and discarding? it.
Comment 1 Karl Palsson 2017-10-27 16:49:01 UTC
Created attachment 7336 [details]
good case, good server at start of list
Comment 2 Karl Palsson 2017-10-27 16:49:56 UTC
Created attachment 7341 [details]
busybox config
Comment 3 Denys Vlasenko 2017-10-28 16:04:47 UTC
Indeed, DNS resolution delay...
Fixed in git this way:

--- a/networking/ntpd.c
+++ b/networking/ntpd.c
@@ -866,10 +866,8 @@ do_sendto(int fd,
 static void
 send_query_to_peer(peer_t *p)
 {
-	if (!p->p_lsa) {
-		if (!resolve_peer_hostname(p))
-			return;
-	}
+	if (!p->p_lsa)
+		return;
 
 	/* Why do we need to bind()?
 	 * See what happens when we don't bind:
@@ -2360,6 +2358,14 @@ int ntpd_main(int argc UNUSED_PARAM, char **argv)
 		int nfds, timeout;
 		double nextaction;
 
+		/* Resolve peer names to IPs, if not resolved yet */
+		for (item = G.ntp_peers; item != NULL; item = item->link) {
+			peer_t *p = (peer_t *) item->data;
+
+			if (p->next_action_time <= G.cur_time && !p->p_lsa)
+				resolve_peer_hostname(p);
+		}
+
 		/* Nothing between here and poll() blocks for any significant time */
 
 		nextaction = G.cur_time + 3600;
Comment 4 Karl Palsson 2017-10-30 16:17:28 UTC
This seems to improve things, but it doesn't fix them for me.

New log from patched version attached
Comment 5 Karl Palsson 2017-10-30 17:29:29 UTC
Created attachment 7351 [details]
still failing after patching

Seems to be "better" but still gets very blocked up and never sends out stratum/step events.  Still behaves quite badly with more bad servers, and also bad servers specified by IP that are unreachable
Comment 6 Denys Vlasenko 2017-10-30 20:02:00 UTC
Can you run ntpd with more verbosity (-ddd) and timestamp the output?
Say, via
CMD 2>&1 | while read line; do echo "`date +%H:%M:%S.%N` $line"; done
Comment 7 Karl Palsson 2017-10-31 10:11:22 UTC
Created attachment 7356 [details]
with timestamps and extra debug

ntpd is correctly stepping time at least, but that's all we get.  We still never consider ourselves in sync. (responds to ntpdate -q queries, nor sends out stratum change events)

Sorry I didn't have more debug enabled earlier, it wasn't obvious (to me) from the help text that I could provide multiple -d flags.
Comment 8 Karl Palsson 2017-10-31 10:23:07 UTC
Created attachment 7361 [details]
post-patch, only dnsnamed hosts

Similar to https://bugs.busybox.net/attachment.cgi?id=7356&action=edit
, just for comparison.  This log doesn't have any bad ip addresses, only names.  In this run, it didn't even manage to step the clock.
Comment 9 Karl Palsson 2017-10-31 10:38:28 UTC
Created attachment 7366 [details]
comparison log from 1.23.2

Just comparison log from older busybox.  This is from before the dns resolution changes and their fixes came in.  step and sync achived in ~60 seconds
Comment 10 Denys Vlasenko 2017-10-31 11:49:26 UTC
I can't reproduce it here (my DNS server caches failures to resolve, so 10-second delay does not happen). Thanks for your debugging.

I reworked the code a bit more. Please try current git.
Comment 11 Karl Palsson 2017-10-31 13:34:44 UTC
Created attachment 7371 [details]
final log, all fixed and working :)

Lovely, all smooth now.

Steps and syncs nice and fast.   I also added in more bad names, and some bad IPs and it handles it all nicely now.