| Summary: | sha1sum slow on x64 and possibly others | ||
|---|---|---|---|
| Product: | Busybox | Reporter: | Błażej Roszkowski <blazejroszkowski> |
| Component: | Other | Assignee: | unassigned |
| Status: | RESOLVED FIXED | ||
| Severity: | enhancement | CC: | blazejroszkowski, busybox-cvs |
| Priority: | P5 | ||
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | All | ||
| Host: | Target: | ||
| Build: | |||
| Attachments: | dot config file from my build on Fedora VM | ||
|
Description
Błażej Roszkowski
2021-11-25 21:28:30 UTC
Unrolling outer loop only does not help noticeably.
Full unrolling of all 80 iterations increases size by ~3600 bytes.
If you want, you can submit a patch where unrolling is done depending on .config option, following this example in libbb/Config.src:
config MD5_SMALL
int "MD5: Trade bytes for speed (0:fast, 3:slow)"
default 1 # all "fast or small" options default to small
range 0 3
help
Trade binary size versus speed for the md5sum algorithm.
Approximate values running uClibc and hashing
linux-2.4.4.tar.bz2 were:
value user times (sec) text size (386)
0 (fastest) 1.1 6144
1 1.4 5392
2 3.0 5088
3 (smallest) 5.1 4912
I'm not sure which is 'outer' loop. In busybox code the outermost loop is for (i = 0; i < 4; i++). In my opinion unrolling it is also worth it (but unrolling full 80 stages is more drastic performance and size jump). I unrolled it in busybox messily (there still is one if with goto in stage 1) by copy pasting entire body of the loop 4 times and deleting parts not relevant to given i value for that copy. Time for 1 GB went from 5.3 to 4.7 (I'm eyeballing but it's 100% noticeable even with random fluctuations of 0.1-0.3 between runs) sha1_process_block64 size went from 361 to 672. Executable size stayed at 976568. I'm guessing it's because code is denser when generated per stage (constant inlined, no ifs and jumps). Here's a commit I also made to test rolling my 80 steps back into 4 for loops of 20: https://github.com/FRex/blasha1/commit/c5a3e5d5d6d0e85f73934e2446fa56fcbc95adeb 1 GB file on Fedora VM takes (in order: my code, my rolled code, busybox, busybox unrolled 4 for loops): 2.4, 4, 5.3, 4.7 On Windows (I didn't test unrolled 4 for loop busybox): 2.4, 3.6, 5.5 Fixed in git, please test. Yes, CONFIG_SHA1_SMALL 0 and 1 improve performance as expected. Thank you. |