correct me or redirect me if i'm wrong : tested with BusyBox 1.24.1 under linux from http://tinycorelinux.net and BusyBox v1.25.0-FRP-311-g23db916 on windows from https://frippery.org/busybox Example with : $cat myfile a b c $sed "s/\(.*\)/\1\1/" myfile a b c it should be : $sed "s/\(.*\)/\1\1/" myfile aa bb cc used once \1 worked as expected but when repeated it does not. thanks for your attention clu 20160312
This looks like a problem with line endings. On Linux the only way I can get the observed behaviour is to convert the input file to have CRLF line endings. The match then includes the CR. The output includes the first backreference, you just can't see it because of the CR. Piping the output through od shows what's going on. GNU sed on Linux exhibits the same behaviour. Since CRLF is the normal line ending on Windows one could argue that including the CR as part of the match there is incorrect. I've raised an issue on GitHub: https://github.com/rmyorston/busybox-w32/issues/52
I've applied a patch to busybox-w32. I don't think any change to BusyBox itself is required.
have you thought about merging busybox-w32 into mainline busybox ?
Sometimes, but I soon convince myself it wouldn't be any fun.
I, wrongly, replied to the email notice abd thus got no reply till notices about your answers. Follows what i wrote in there, without quoting previous comments : ============================================================ I hope this is the correct way to reply : 1. replying to the email i received (w/o going back to the bugzilla page (from here i can only access emails). 2. writing my answer after yours in chronological order. tell me which is correct if anything wrong. Now, about an issue on https://github.com/rmyorston/busybox-w32 : 1. isn't it restricting the pb to windows when streams with CRLF line endings can also be fed to sed under unix? 2. as my example was simple it actually wasn't generic at all since i only used the \(.*\) pattern as reference ( which alone stands for ^\(.*\)$ ) and, indeed, goes till the end of line. But the problem also occurs with patterns referencing inner parts of the streamed line : say, something like xy\(.*\)zt which is closer to the actual cases i noticed the pb with. 3. does that mean it will impact the buffers are managed (via N, H &co.), in the case i want to handle multiline patterns ? thank you for your attention clu 20160315
>Now, about an issue on https://github.com/rmyorston/busybox-w32 : > 1. isn't it restricting the pb to windows when streams with CRLF line > endings can also be fed to sed under unix? Microsoft Windows and Unix treat DOS-format files differently. In the former the CRLF is a line terminator; in the latter the LF is a line terminator and CR is part of the line. busybox-w32 should have used the platform convention for line terminators and now it does. Processing DOS-format files with Unix tools may require the file to be converted to Unix conventions first. GNU sed and BusyBox sed are consistent in their handling of DOS-format files: they both treat the CR as part of the line. I think this is the correct way for Unix tools to behave. >2. as my example was simple it actually wasn't generic at all since i only >used the \(.*\) pattern as reference ( which alone stands for ^\(.*\)$ ) >and, indeed, goes till the end of line. But the problem also occurs with >patterns referencing inner parts of the streamed line : > say, something like xy\(.*\)zt which is closer to the actual cases i > noticed the pb with. Can you provide an example of the problem? I tried the following with BusyBox sed, busybox-w32 sed and GNU sed with DOS-format and Unix-format files. In each case the result is the same and consistent with what I'd expect. $ cat myfile xyazt xybzt xyczt $ sed 's/xy\(.*\)zt/\1\1/' myfile aa bb cc
indeed, you're working example shows that no impact exists within the line contrary to my 1st memory of it. So, when dealing with microsoft platform originated files dos2unix it is! (sorry for the trouble here). And i better do it at the source to avoid messing within my scripts. In the present case i dealt with 7zip to rename files within archives and 7z was, then, delivering those CRLF EOL I'll have to, for instance, do : $7z l arc.zip | dos2unix | sed -e 'my scripts" ... before going any further. BTW, since i did now : $ cat myfile | dos2unix | sed 's/\(.*\)/\1\1/' aa bb cc giving the expected result, what did the patch consist of? and, do i need it? If so... as a selfish request from an incompetent coder, could you, please, provide the resulting binary in the repository? Danke Shön so much :-) Antonio 20160317
The patch just removes the CR as well as the LF from the end of line before the sed patterns are applied. I've built a new binary of busybox-w32. It's available from my website now. With this version it shouldn't be necessary to use dos2unix.
thank you, Ron, for the binary : it does as advertised... but i will stick with dos2unix whenever possible to make the scripts more robust across platforms (however no harm when applied to unix style files so...). Once more, thanks for the lesson (hoping i won't fall again in the trap of CR in the M$ style of EOL) reminding that there is the tool and its characteristics but also the data themselves, inheriting their source's features. Antonio 20160317