Bug 5090 - sed and awk mishandle \b \< \B
Summary: sed and awk mishandle \b \< \B
Status: NEW
Alias: None
Product: Busybox
Classification: Unclassified
Component: Standard Compliance (show other bugs)
Version: 1.19.x
Hardware: PC Linux
: P5 minor
Target Milestone: ---
Assignee: unassigned
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-12 04:22 UTC by dubiousjim
Modified: 2014-10-21 22:50 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description dubiousjim 2012-04-12 04:22:03 UTC
BusyBox 1.19.3, built against uClibc 0.9.32, on i686 Linux

Since this affects both sed and awk, perhaps it's an issue with uClibc. However, it does not affect BusyBox egrep.

$ printf 'abcd efgh\n' | sed -n 's/\b[a-z]/<&>/pg'
Expected result: <a>bcd <e>fgh
Actual result: <a><b><c><d> <e><f><g><h>
$ printf 'abcd efgh\n' | sed -n 's/\<[a-z]/<&>/pg'
Expected result: <a>bcd <e>fgh
Actual result: <a><b><c><d> <e><f><g><h>
$ printf 'abcd efgh\n' | sed -n 's/\B[a-z]/<&>/pg'
Expected result: a<b><c><d> e<f><g><h>
Actual result: a<b>c<d> e<f>g<h>  # misses the c and g
$ printf 'abcd efgh\n' | awk '{gsub(/\<[a-z]/,"<&>"); print $0}'
Expected result: <a>bcd <e>fgh
Actual result: <a><b><c><d> <e><f><g><h>
$ printf 'abcd efgh\n' | awk '{gsub(/\B[a-z]/,"<&>"); print $0}'
Expected result: a<b><c><d> e<f><g><h>
Actual result: a<b>c<d> e<f>g<h>  # misses the c and g

The end-of-word elements all give the expected results:

$ printf 'abcd efgh\n' | sed -n 's/[a-z]\b/<&>/pg'
abc<d> efg<h>
$ printf 'abcd efgh\n' | sed -n 's/[a-z]\>/<&>/pg'
abc<d> efg<h>
$ printf 'abcd efgh\n' | sed -n 's/[a-z]\B/<&>/pg'
<a><b><c>d <e><f><g>h
$ printf 'abcd efgh\n' | awk '{gsub(/[a-z]\>/,"<&>"); print $0}'
abc<d> efg<h>
$ printf 'abcd efgh\n' | awk '{gsub(/[a-z]\B/,"<&>"); print $0}'
<a><b><c>d <e><f><g>h
Comment 1 Phil Carmody 2014-10-21 22:50:05 UTC
looks like it does affect busybox (e)grep too, but I agree that the error seems to be inside the regex library itself:

phil@geespaz:busybox$ echo 'azz bz c d' | ./busybox egrep -o '\b[a-z]'
a
z
z
b
z
c
d
phil@geespaz:busybox$ echo 'azz bz c d' | egrep -o '\b[a-z]'
a
b
c
d

My regex library is:
phil@geespaz:busybox$ nm ./busybox_unstripped | grep regexec
         U regexec@@GLIBC_2.3.4