Bug 8791 - sed : in substitution string \1 to \9 or & don't behave as per GNU & POSIX specification
Summary: sed : in substitution string \1 to \9 or & don't behave as per GNU & POSIX sp...
Status: NEW
Alias: None
Product: Busybox
Classification: Unclassified
Component: Standard Compliance (show other bugs)
Version: unspecified
Hardware: PC All
: P5 normal
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-12 22:55 UTC by clu
Modified: 2016-04-21 17:52 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description clu 2016-03-12 22:55:52 UTC
correct me or redirect me if i'm wrong :
tested with BusyBox 1.24.1 under linux from http://tinycorelinux.net and
BusyBox v1.25.0-FRP-311-g23db916 on windows from https://frippery.org/busybox
Example with :
  $cat myfile
  a
  b
  c
  $sed "s/\(.*\)/\1\1/" myfile
  a
  b
  c
it should be :
  $sed "s/\(.*\)/\1\1/" myfile
  aa
  bb
  cc
used once \1 worked as expected but when repeated it does not.

thanks for your attention
clu
20160312
Comment 1 Ron Yorston 2016-03-14 17:28:56 UTC
This looks like a problem with line endings.  On Linux the only way I can get the observed behaviour is to convert the input file to have CRLF line endings.  The match then includes the CR. The output includes the first backreference, you just can't see it because of the CR.  Piping the output through od shows what's going on.  GNU sed on Linux exhibits the same behaviour.

Since CRLF is the normal line ending on Windows one could argue that including the CR as part of the match there is incorrect.  I've raised an issue on GitHub:

   https://github.com/rmyorston/busybox-w32/issues/52
Comment 2 Ron Yorston 2016-03-16 17:51:27 UTC
I've applied a patch to busybox-w32.  I don't think any change to BusyBox itself is required.
Comment 3 Mike Frysinger 2016-03-16 18:25:39 UTC
have you thought about merging busybox-w32 into mainline busybox ?
Comment 4 Ron Yorston 2016-03-16 21:09:03 UTC
Sometimes, but I soon convince myself it wouldn't be any fun.
Comment 5 clu 2016-03-17 03:25:28 UTC
I, wrongly, replied to the email notice abd thus got no reply till notices about your answers. Follows what i wrote in there, without quoting previous comments :
============================================================
I hope this is the correct way to reply :
 1. replying to the email i received (w/o going back to the bugzilla page (from here i can only access emails).
 2. writing my answer after yours in chronological order.
 tell me which is correct if anything wrong.
Now, about an issue on https://github.com/rmyorston/busybox-w32 :
 1. isn't it restricting the pb to windows when streams with CRLF line endings can also be fed to sed under unix?
 2. as my example was simple it actually wasn't generic at all since i only used the \(.*\) pattern as reference ( which alone stands for ^\(.*\)$ ) and, indeed, goes till the end of line. But the problem also occurs with patterns referencing inner parts of the streamed line :
 say, something like xy\(.*\)zt which is closer to the actual cases i noticed the pb with.
 3. does that mean it will impact the buffers are managed (via N, H &co.), in the case i want to handle multiline patterns ?
 
thank you for your attention
clu
20160315
Comment 6 Ron Yorston 2016-03-17 09:45:53 UTC
>Now, about an issue on https://github.com/rmyorston/busybox-w32 :
> 1. isn't it restricting the pb to windows when streams with CRLF line
> endings can also be fed to sed under unix?

Microsoft Windows and Unix treat DOS-format files differently.  In the former the CRLF is a line terminator; in the latter the LF is a line terminator and CR is part of the line.  busybox-w32 should have used the platform convention for line terminators and now it does.  Processing DOS-format files with Unix tools may require the file to be converted to Unix conventions first.  GNU sed and BusyBox sed are consistent in their handling of DOS-format files:  they both treat the CR as part of the line.  I think this is the correct way for Unix tools to behave.

>2. as my example was simple it actually wasn't generic at all since i only
>used the \(.*\) pattern as reference ( which alone stands for ^\(.*\)$ )
>and, indeed, goes till the end of line. But the problem also occurs with
>patterns referencing inner parts of the streamed line :
> say, something like xy\(.*\)zt which is closer to the actual cases i
> noticed the pb with.

Can you provide an example of the problem?  I tried the following with BusyBox sed, busybox-w32 sed and GNU sed with DOS-format and Unix-format files.  In each case the result is the same and consistent with what I'd expect.

$ cat myfile
xyazt
xybzt
xyczt
$ sed 's/xy\(.*\)zt/\1\1/' myfile
aa
bb
cc
Comment 7 clu 2016-03-17 17:14:21 UTC
indeed, you're working example shows that no impact exists within the line contrary to my 1st memory of it. So, when dealing with microsoft platform originated files dos2unix it is! (sorry for the trouble here).
And i better do it at the source to avoid messing within my scripts.
In the present case i dealt with 7zip to rename files within archives and 
7z was, then, delivering those CRLF EOL I'll have to, for instance, do :
$7z l arc.zip | dos2unix | sed -e 'my scripts" ... before going any further.

BTW, since i did now :
$ cat myfile | dos2unix | sed 's/\(.*\)/\1\1/'
aa
bb
cc
giving the expected result, what did the patch consist of? and, do i need it?
If so... as a selfish request from an incompetent coder, could you, please,
provide the resulting binary in the repository? 
Danke Shön so much :-)

Antonio
20160317
Comment 8 Ron Yorston 2016-03-17 17:42:16 UTC
The patch just removes the CR as well as the LF from the end of line before the sed patterns are applied.

I've built a new binary of busybox-w32.  It's available from my website now.  With this version it shouldn't be necessary to use dos2unix.
Comment 9 clu 2016-03-17 20:20:15 UTC
thank you, Ron, for the binary : it does as advertised...
but i will stick with dos2unix whenever possible to make the scripts more robust across platforms (however no harm when applied to unix style files so...).
Once more, thanks for the lesson (hoping i won't fall again in the trap of CR in the M$ style of EOL) reminding that there is the tool and its characteristics but also the data themselves, inheriting their source's features.

Antonio
20160317