Bug 12981

Summary: awk: seems to try to falsely expand strings
Product: Busybox Reporter: Steffen Nurpmeso <steffen>
Component: OtherAssignee: unassigned
Status: RESOLVED FIXED    
Severity: normal CC: aldot, busybox-cvs
Priority: P2    
Version: 1.31.x   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Host: Target:
Build:
Attachments: Test script for reproduction

Description Steffen Nurpmeso 2020-06-04 21:48:48 UTC
Created attachment 8491 [details]
Test script for reproduction

Hello!

I have encountered a busybox 1.31.1 awk (AlpineLinux edge) problem that cannot be seen with mawk, mawk-debian nor gawk.  nawk segfaults for the script so i cannot say :-)

Please find attached a reduced script which must be run like

 # awk="/bin/busybox awk" CC=gcc TARGET=1 sh ./su-make-errors.sh

Sorry for $CC, it can be any ISO C89 compiler i would say.
The difference (compared to the other awks running on CRUX-Linux) will be

  --- .x.gawk     2020-06-04 23:24:13.000905415 +0200
  +++ .x.busybox-awk      2020-06-04 23:23:51.210905829 +0200
  @@ -2,8 +2,8 @@
   #define su__ERR_NUMBER_MAX 0xFFu
   #define su__ERR_NUMBER_ENUM_C \
      su_ERR_NONE = 0,\
  -   su_ERR_AGA = (su__ERR_NUMBER_MAX - 1),\
  -   su_ERR_AGA2 = (su__ERR_NUMBER_MAX - 2),\
  +   su_ERR_AGA = 00),\
  +   su_ERR_AGA2 = 10),\

These strings come from code which says

         while(getline dl < dat){
            ++cnt
            split(dl, ia)
            if(ia[1] < 0)
               ia[1] = "(su__ERR_NUMBER_MAX - " ++unavail ")"
            map[ia[2]] = ia[1]
            print verb "su_ERR_" ia[2] " = " ia[1] ",\134"
         }
         close(dat)

Ciao!
Comment 1 Steffen Nurpmeso 2020-06-04 22:36:11 UTC
I can workaround the bug if i use

            if(ia[1] < 0){
               ++unavail
               ia[1] = "(su__ERR_NUMBER_MAX - " unavail ")"
            }

instead of

            if(ia[1] < 0)
               ia[1] = "(su__ERR_NUMBER_MAX - " ++unavail ")"
Comment 2 Bernhard Reutner-Fischer 2020-06-05 09:47:38 UTC
Can you please provide a small, self-contained reproducer?

Also, i think that you should put parentheses around the ++unavail in your original code since there is no defined precenence around string concatenation.
Comment 3 Steffen Nurpmeso 2020-06-05 19:50:07 UTC
The nawk bug is already fixed, heh. Luckily i saved away the script, another version did not reveal the crash there.  The final version is also entirely different.  I mean hey, i came to bed at half past five in the morning, ok.
But yes i indeed can.  Here mawk, mawk-debian, nawk, gawk:

 #?0|kent:tmp$ </dev/null awk -v i=1 'BEGIN {print "hey fish " ++i "."}'
 hey fish 2.

Compared to

 #?0|sdaoden:steffen$ </dev/null awk -v i=1 'BEGIN {print "hey fish " ++i "."}'
 01.

Any why has it to be minimal, i never found a test series for busybox (i looked out for one years ago)?
Just asking..

I have asked Aharon Robbins for that "++i should be (++i)" of yours.
He kindly proved me wrong regarding my boolean use of getline() not getline()>0 already, i had to fix that in several places.  If he responds to support my prefix/postfix in/decrement has highest priority .. i'll be back :-)

A nice weekend everybody i (who hates web interfaces like this, i wished i could simply respond via email) wish from Germany,
Ciao,
Comment 4 Steffen Nurpmeso 2020-06-06 21:12:11 UTC
Actually does not seem to work out.
I paste it here.
Thanks.

---

Oh hello i see and try replying!

Good evening.

Bernhard Reutner-Fischer wrote in
<20200606021709.4f5cd8bf@nbbrfq.cc.univie.ac.at>:
 |On Fri, 05 Jun 2020 19:50:07 +0000
 |bugzilla@busybox.net wrote:
 |> https://bugs.busybox.net/show_bug.cgi?id=12981
 ...
 |> But yes i indeed can.  Here mawk, mawk-debian, nawk, gawk:

And SunOS 5.9 /usr/bin/xpg4.

 |That's great, thanks alot for this simplification!

Hm.

 |>  #?0|kent:tmp$ </dev/null awk -v i=1 'BEGIN {print "hey fish " ++i "."}'
 |>  hey fish 2.
 ...
 |> Any why has it to be minimal, i never found a test series for busybox \
 |> (i looked
 |
 |I didn't say minimal. I said "small, self-contained".
 |
 |In your bug-report, you referenced a
 |IN="${SRCDIR}"su/gen-errors.h
 |which you did not attach. Although i could probably construct an input

It worked like shown.  I gave the necessary command line.  Sorry.
But please, at least i have stripped 251 lines of perl code before
attaching the file.  (vim fold.)

 |that exposes the alleged misbehaviour (or maybe i would not be able to
 |if it was another kind of bug as we're dealing with here) i would have
 |to guess. You did not provide a readily testable reproducer for us to
 |debug. We'd have to guess or suspect the problem to be obscured by an

That is not true really.

 |overly complex, incomplete testcase. This is very discouraging a
 |situation if you just have a couple of minute to read through a report
 |while, let's say, you are in the bus on your way home after eight or
 |thirteen hours of work.

Yes, it is good now.

 |> out for one years ago)?
 |> Just asking..
 |
 |We indeed have a testsuite. See testsuite/
 |as well as
 |$ make help
 |$ make check
 |and, specifically
 |testsuite/awk.tests
 |
 |We've had a testsuite since a long time now. An internal one. We

I must have confused something.  Indeed testsuite came in Y2K1.

 |usually test some prominent cases with external testsuites. We used to
 |(and still some of us do) build our setups with busybox alone so the
 |most visible stuff supposedly works fine. In order not to regress we
 |strive to flesh out the internal testsuite. Except for network related
 |stuff which would usually be rather complicated to regtest in our
 |current testsuite due to it's simplicity.
 |
 |>
 |> I have asked Aharon Robbins for that "++i should be (++i)" of yours.
 |
 |Name dropping usually doesn't work for me for i know nothing, but sure,
 |if he knows awk land i'm all ears -- whoever that may be.. Please
 |excuse my ignorance.

You know i wondered whether i should response at all because of
that pissing stuff.  But i did.

 |> He kindly proved me wrong regarding my boolean use of getline() not \
 |> getline()>0
 |> already, i had to fix that in several places.  If he responds to \
 |> support my
 |> prefix/postfix in/decrement has highest priority .. i'll be back :-)
 |
 |looking forward to that.
 |>
 |> A nice weekend everybody i (who hates web interfaces like this, i \
 |> wished i
 |> could simply respond via email) wish from Germany,
 |
 |web interfaces are great to not forget stuff, but sure,
 |let's just take this to the list for a wider audience and easier
 |interaction for both of us.

These are in fact almost his words in our thread on the nawk bug,
because i do not have a github account and asked him whether
private mail would be ok.
Does not seem i get a response.
 |Folks, recap.
 |So, in this bug-reoprt what we have is:
 |- there is no awk standard (that i'm aware of at least; SuS is
 |  silent)
 |- awk has no (well) defined string-concatination precedence
 |- some (well) documented awk implementations explicitly warn
 |  script-authors to rely on precedence around string concatenation,
 |  like e.g.:
 |  https://www.gnu.org/software/gawk/manual/html_node/Concatenation.html

I have never read anything like that in any manual page of all the
awks i have or have seen.  (On CRUX-Linux there is no info system,
no info files.)  Many warn due to problematic number / string as in

 There  are  no  explicit  conversions  between  numbers  and
     strings.  To  force an expression to be treated as a number,
     add 0 to it. To force an  expression  to  be  treated  as  a
     string, concatenate the null string ("") to it.
  SunOS 5.9            Last change: 7 Jul 2000                    7

Over there there is /usr/bin/awk, which only works when passed
a script file via -f (where -f - works too), and this one really
behaves like the awk you are protecting from moving forward.

 |---8<---
 |Because string concatenation does not have an explicit operator, it \
 |is often necessary to ensure that it happens at the right time by using \
 |parentheses to enclose the items to concatenate.
 |---8<---
 |- new users of awk usually expect a well resp. clearly defined language
 |  but IMHO we do not face the luxory to deal with that in this context.
 |  Rule of least surprise for the user applies, as usual. Undefined
 |  corners of underspecified tools/behaviour is usually clarified
 |  centrally (by a standard of some sorts to back up claims or document
 |  consensus umong implementors). No such luxory here AFAIK.
 |- If you write your scripts defensively (i.e. know your stuff, to
 |  phrase it politely) all is well even with current busybox awk (see
 |  below)

I think i will in the future not respond to messages coming from
you, really.  Again, i have never ever read anything about that
behaviour in any of the awks i have seen.  mawk and debian-mawk
(manual date Dec 22 1994) both say

 Functions can be referenced before they are defined, but the
 function name and the '(' of the arguments must touch to avoid
 confusion with concatenation.

Hey.  I would suspect the rule of least surprise would in
a "modern" environment support moving busybox awk to where the
others are.

 |Bad:
 |$ ./busybox awk -v i=1 'BEGIN {print "fourty " ++i "."}'
 |01.
 |
 |Expected:
 |$ ./busybox awk -v i=1 'BEGIN {print "fourty " (++i) "."}'
 |fourty 2.
 |
 |Proposed fixes?
 |Follow up further improvements for awk? Equally robustness improvements \
 |to deal with possible similar cases you might make up while testing \
 |or debugging stuff? Or maybe you spot some areas that are written way \
 |too elaborate and you can save some bytes after having fixed this unspec\
 |ified but seemingly widely supported precedence convention?
 |
 |> Ciao,
 |
 |cheers and have phun,

That surely goes in the wrong direction man. :)

Ciao and a nice Sunday i wish.,

 --End of <20200606021709.4f5cd8bf@nbbrfq.cc.univie.ac.at>
Comment 5 Steffen Nurpmeso 2020-06-08 15:14:58 UTC
Hello again.

Just to add, i got response regarding this awk behaviour from Aharon (Arnold) Robbins (the maintainer of GNU awk and now also nawk), and he says

  >   Also, i think that you should put parentheses around the
  >   ++unavail in your original code since there is no defined
  >   precenence around string concatenation.
  >
  > This is for the snippet
  >
  >                ia[1] = "(su__ERR_NUMBER_MAX - " ++unavail ")"

  I think he's wrong.  The ++unavail would have to be done before
  converting the result to string for the concatentation.

  My opinion, of course.

Ciao from Germany (what a pity this cannot be driven with email responses, urgh)
Comment 6 Bernhard Reutner-Fischer 2020-06-08 17:40:47 UTC
(In reply to Steffen Nurpmeso from comment #5)

>  >   Also, i think that you should put parentheses around the
>  >   ++unavail in your original code since there is no defined
>  >   precenence around string concatenation.
>  >
>  > This is for the snippet
>  >
>  >                ia[1] = "(su__ERR_NUMBER_MAX - " ++unavail ")"
>
>  I think he's wrong.  The ++unavail would have to be done before
>  converting the result to string for the concatentation.

yes, which you can enforce explicitly by adding parenthesis.

I looked a bit into our awk implementation yesterday but i'm not yet sure how to fix it properly.
We indeed end up calling setvar_i on the lefthand string to set it's number to 0, which clears the string-value of the node. This is of course not what we want, and is the cause of the "01" output.
Comment 7 Steffen Nurpmeso 2020-06-08 20:34:45 UTC
Hi :).

One more addition, he responded once again.  I say Ciao already here.

> Thanks for responding.  I copied the above to the busybox bugzilla
> thread of this, i hope this is acceptable for you.  He (Bernhard
> Reutner-Fischer) referred to the gawk info manual even (it is not
> in the manual page):
>
>   - some (well) documented awk implementations explicitly warn
>     script-authors to rely on precedence around string concatenation,
>     like e.g.:
>     https://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
>   ---8<---
>   Because string concatenation does not have an explicit operator,
>   it is often necessary to ensure that it happens at the right
>   time by using parentheses to enclose the items to concatenate.

This is mainly for things like

        foo bar | getline

Is this

        (foo bar) | getline

or

        foo (bar | getline)

?

I stand by my opinion that what you observed in Busybox awk is a bug.
Comment 8 Denys Vlasenko 2020-06-08 23:35:46 UTC
"str" ++ i
is interpreted as postfix increment of "str". Oops...

Fixed in git, thanks!

Reopen if not fixed completely, provide examples.