Created attachment 9171 [details] dot config file from my build on Fedora VM sha1sum in BusyBox is over twice as slow as decently optimized C implementation. All tests were done on x64-86 (VMs and real OSes, Windws and Linux, two laptops). I don't know if this is a performance problem on other architectures (I'd guess it is). Test file (1 GiB): dd if=/dev/urandom of=gig.gig bs=1024 count=$((1024 * 1024)) GNU coreutils sha1sum (Git for Windows, no libcrypto use) and my own implementation take (on my personal laptop) 2.8 seconds. BusyBox takes 6.3 seconds. It's around 2x slower on my work laptop as well. This is present in at least versions: 1.34.1 (my own build on Fedora 34, .config attached), 1.34.1 in latest Alpine (3.15.0), 1.34.1 from Fedora 34 repos, 1.34.0 on Windows. I also remember it being present on Ubuntu 18.04 LTS and 20.04 LTS as well (busybox from the repos). My optimized plain C sha1 implementation (that I'm happy to contribute) is here: https://github.com/FRex/blasha1 The only downside I see from using optimized C version is potential increase in binary size, since optimized code is heavily unrolled (but I didn't investigate this increase). I've searched for "performance", "sha1", "sha1sum", and "speed" on this Bugzilla and found nothing about this. I understand it's an old semi-obsolete algorithm but if BusyBox provides this util, I assume it's better that it's faster rather than slower.
Unrolling outer loop only does not help noticeably. Full unrolling of all 80 iterations increases size by ~3600 bytes. If you want, you can submit a patch where unrolling is done depending on .config option, following this example in libbb/Config.src: config MD5_SMALL int "MD5: Trade bytes for speed (0:fast, 3:slow)" default 1 # all "fast or small" options default to small range 0 3 help Trade binary size versus speed for the md5sum algorithm. Approximate values running uClibc and hashing linux-2.4.4.tar.bz2 were: value user times (sec) text size (386) 0 (fastest) 1.1 6144 1 1.4 5392 2 3.0 5088 3 (smallest) 5.1 4912
I'm not sure which is 'outer' loop. In busybox code the outermost loop is for (i = 0; i < 4; i++). In my opinion unrolling it is also worth it (but unrolling full 80 stages is more drastic performance and size jump). I unrolled it in busybox messily (there still is one if with goto in stage 1) by copy pasting entire body of the loop 4 times and deleting parts not relevant to given i value for that copy. Time for 1 GB went from 5.3 to 4.7 (I'm eyeballing but it's 100% noticeable even with random fluctuations of 0.1-0.3 between runs) sha1_process_block64 size went from 361 to 672. Executable size stayed at 976568. I'm guessing it's because code is denser when generated per stage (constant inlined, no ifs and jumps). Here's a commit I also made to test rolling my 80 steps back into 4 for loops of 20: https://github.com/FRex/blasha1/commit/c5a3e5d5d6d0e85f73934e2446fa56fcbc95adeb 1 GB file on Fedora VM takes (in order: my code, my rolled code, busybox, busybox unrolled 4 for loops): 2.4, 4, 5.3, 4.7 On Windows (I didn't test unrolled 4 for loop busybox): 2.4, 3.6, 5.5
Fixed in git, please test.
Yes, CONFIG_SHA1_SMALL 0 and 1 improve performance as expected. Thank you.