Bug 14541 - sed: s-command with "semi-special" delimiters get wrong behaviour
Summary: sed: s-command with "semi-special" delimiters get wrong behaviour
Status: RESOLVED FIXED
Alias: None
Product: Busybox
Classification: Unclassified
Component: Standard Compliance (show other bugs)
Version: 1.30.x
Hardware: All All
: P5 major
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-21 16:06 UTC by Christoph Anton Mitterer
Modified: 2022-02-01 20:16 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Anton Mitterer 2022-01-21 16:06:09 UTC
Hey.

I recently looked into behaviour of sed implementations when unusual delimiters were used in:
- context addresses
- s-command


The outcome was, that in my opinion POSIX itself is pretty ambiguous with respect to how it defines these and their parsing and semantics.
See more about that here:
- https://www.austingroupbugs.net/view.php?id=1550
  (not so interesting, just minor clarification proposals)
- https://www.austingroupbugs.net/view.php?id=1551
- https://www.austingroupbugs.net/view.php?id=1556

Especially issue 1551 gives several example where BusyBox' sed behaves considerably different from GNU's sed (which by itself is however not necessarily a bug, especially when even POSIX seems ambiguous).

But there is one case at least, where I think BusyBox' sed is definitely wrong

It's described in detail in:
- https://austingroupbugs.net/view.php?id=1551        (the "main" report and there point (3) )
  https://austingroupbugs.net/view.php?id=1551#c5611  (some additions to point "(3)")

and a tabular overview given in:
- https://austingroupbugs.net/view.php?id=1551#c5612




In short (using BREs as example):

Consider a delimiter is used, that is by itself not a special character (again BREs as example) like  (the non-standard) '+' which is however special when preceded by a '\' (that is itself not quoted).
Same example would work with the (standard) '(', though with different effects.

In BusyBox sed, the following happens:
   $ printf '%s\n' '9+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '99+' | busybox sed 's+9\++X+'
   X+
   $ printf '%s\n' '999+' | busybox sed 's+9\++X+'
   X+

In GNU sed:
   $ printf '%s\n' '9+' | sed 's+9\++X+'
   X
   $ printf '%s\n' '99+' | sed 's+9\++X+'
   9X
   $ printf '%s\n' '999+' | sed 's+9\++X+'
   99X


In BREs, '+' alone is never special, and the '\+' here is clearly the escape of the delimiter.
Regardless of what POSIX actually means with "literal" (see the discussion in the tickets above), ... there is IMO no way to interpret it like what BusyBox seems to do, which is:

- "un-delimiter" the '+' (because of it's preceding '\' )
- still keep the '\' with respect to the RE and give special meaning to the '+'

That's like "doubling" the effect of the '\'.




The https://austingroupbugs.net/view.php?id=1551#c5611 mentioned above, describes how dangerous this actually is.
Cause even if one uses delimiters, which don't seem "special" in any way, like with:
   $ printf '%s\n' '9' | busybox sed 'sw9\wwXw'
   9
   $ printf '%s\n' '99' | busybox sed 'sw9\wwXw'
   X
   $ printf '%s\n' '999' | busybox sed 'sw9\wwXw'
   X9

BusyBox behaviour get's odd, when that normal character has some special meaning, when preceded by a '\' and when that sequence has than some (non-standard) special meaning (like '\w').

GNU sed handles these like:
   $ printf '%s\n' '99' | sed 'sw9\wwXw'
   99
   $ printf '%s\n' '9w' | sed 'sw9\wwXw'
   X
so it effectively takes the \w ... makes it a non-delimiter (and removes the escaping), being just left with the character 'w' which is by itself literal.



I'd guess that all this might also apply to context addresses in BusyBox sed. Not sure whether EREs are also affect in some weird way.



Thanks,
Chris.
Comment 1 Denys Vlasenko 2022-01-23 17:52:15 UTC
Fixed in git.
Comment 2 Christoph Anton Mitterer 2022-01-24 14:49:14 UTC
Just for confirmation:

- Since the tests are only about s-commands, does your commit also fix the context addresses?

- The test you added for \& handling was just about adding the missing test, right? The behaviour itself already worked before?!

- I assume these functions were just for BRE, right?!



- Does that commit also "fix" (well POSIX is ambiguous, IMO, so its actually more a "align with GNU sed") the following two:

a) \1 in replacement:
Consider:
s1\(x\)1\11

which, depending on what "literal" means, could be either effectively:
s/\(x\)/\1/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'oxo' | busybox sed 's1\(x\)1\11'
   oxo
   $ printf '%s\n' 'owo' | busybox sed 's1\(x\)1\11'
   owo
)

or:

s/\(x\)/1/
(GNU sed seems to do this:
   $ printf '%s\n' 'oxo' | sed 's1\(x\)1\11'
   o1o
   $ printf '%s\n' 'owo' | sed 's1\(x\)1\11'
   owo
) 





b) \1 in BRE (but not ERE, where \1 isn't defined for the RE part)
Consider:
s1\(xo\)\11X1

which, depending on what "literal" means, could be either effectively:
s/\(xo\)\1/X/
(BusyBox sed seems to do this:
   $ printf '%s\n' 'xoxo' | busybox sed 's1\(xo\)\11X1'
   X
   $ printf '%s\n' 'xo1' | busybox sed 's1\(xo\)\11X1'
   xo1
)

or:

s/\(xo\)1/X/
(GNU sed seems to do this:
   $ printf '%s\n' 'xoxo' | sed 's1\(xo\)\11X1'
   xoxo
   $ printf '%s\n' 'xo1' | sed 's1\(xo\)\11X1'
   X
)
Comment 3 Christoph Anton Mitterer 2022-01-24 15:37:28 UTC
I just went through my list of test cases in https://www.austingroupbugs.net/view.php?id=1551#c5612 :


It seems busybox now (with the patch) behaves as GNU, except or one case:

GNU:
$ printf '%s\n' 'oxo' | sed 's1\(x\)1\11'
o1o
$ printf '%s\n' 'owo' | sed 's1\(x\)1\11'
owo


BusyBox with patch:
$ printf '%s\n' 'oxo' | ./busybox sed 's1\(x\)1\11'
oxo
$ printf '%s\n' 'owo' | ./busybox sed 's1\(x\)1\11'
owo


So in the replacement, the bug doesn't seem to be fixed, yet.

Same case,.. the \1 is first "un-delimitered" but then still counted as \1, though the \ should have already been removed because of the "un-delimitering".

Thus reopneing.
Comment 4 Christoph Anton Mitterer 2022-02-01 20:16:08 UTC
I hadn't seen the 2nd commit, f12fb1e4092900f26f7f8c71cde44b1cd7d26439, when testing.

That also fixes the case from comment #3.

Now, BusyBox sed seems to behave identically to GNU sed in all the cases I had given in:
https://www.austingroupbugs.net/view.php?id=1551#c5612

Especially, it also seems to consider "un-delimitered" delimiters that are also special characters as "still special" (or at least I tried that with '.') - which is, while IMO not clearly defined by POSIX, identical to the behaviour of GNU sed, see https://www.austingroupbugs.net/view.php?id=1551#c5648 for test cases.)


Thus closing again.

Thanks.