Menu

#10 [p]bosh: field splitting happens where other shells think it should not

1.0
open
nobody
None
2019-05-21
2019-04-25
No

Using Schily-tools 2019-03-29 on Linux (Ubuntu 18.04 LTS).

The following command:

pbosh -c 'v=foo eq== IFS==; echo A=A "$v"=\$bar'

Produces A=A foo $bar in [p]bosh, while all other shells I tested (bash, dash, ksh93, mksh, pdksh, OpenBSD sh, and others) produce A=A foo=$bar

The 2018 spec says (2.6.5):
"... the shell shall scan the results of expansions and substitutions that did not occur in double-quotes for field splitting and multiple fields can result."

So it seems that all other shells interpret "results" as only the new text which came from replacing the $<thing> part in a word, while [p]bosh interprets it as words which had any expansions/substitutions in them.

I tend to think the same as other shells, as otherwise it means that IFS affects literal shell input, which I think should never happen.

Discussion

  • Avi Halachmi

    Avi Halachmi - 2019-04-25

    Also, here's a slightly different command which shows another case and also the actual resulting fields (I don't know if the extra field should be considered a bug):

    <shell> -c 'v=foo eq== IFS==; printf "[%s]" A=A B=$eq "$v"=\$bar'
    

    All shells I have except pdksh and [p]bosh produce:

    [A=A][B=][foo=$bar]
    

    pdksh:

    [A=A][B=][][foo=$bar]
    

    [p]bosh:

    [A=A][B][][foo][$bar]
    
     

    Last edit: Avi Halachmi 2019-04-25
  • Jörg Schilling

    Jörg Schilling - 2019-04-26

    Hi,

    bosh produces the same output as you get from ksh88 that has been used as the reference implementation for POSIX and ksh88 based POSIX platforms like Solaris pass the POSIX certification tests with that behavior.

    Bourne Shell:
    sh -c 'v=foo eq== IFS==; echo A=A "$v"=\$bar'
    A A foo $bar

    ksh88
    ksh -c 'v=foo eq== IFS==; echo A=A "$v"=\$bar'
    A=A foo $bar

    bosh:
    bosh -c 'v=foo eq== IFS==; echo A=A "$v"=\$bar'
    A=A foo $bar

    For your second example you get:

    Bourne Shell:
    [A][A][B][foo][$bar]

    ksh88:
    [A=A][B][][foo][$bar]

    bosh:
    [A=A][B][][foo][$bar]

    Note that both ksh88 and bosh are based on a slightly modified Bourne Shell source.

    The behavior of ksh93 ist most likely correct, but implementing the ksh93 behavior
    requires a complete rewrite of the macro expansion code.

    ksh88 and bosh modified the macro expansion code in a way that is small enough
    to avoid bigger problems that result from a complete rewrite.

    The behavior of ksh88 and bosh is to do field splitting in arguments that had a variable
    expansion, while the ksh93 behavior is to do field splitting on characters that resulted
    from a macro replacement.

    Do you have a real world usage for that difference in behavior?

    There is a plan to do a complete rewrite of the macro expansion for bosh, but
    I did not yet find the time to do so. The background for the rewrite is that it would
    make bosh the fastest shell if you use "configure" as the test scenario as I expect a
    performance win of approx. 20%.

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-26

    Do you have a real world usage for that difference in behavior?

    I don't.

    Because I'm a relatively new to shell programming, I prefer to quote only whhere required, and if I'm not sure, then research and learn. I was evaluating to which extent the (my) paradigm of eval $foo=\$bar is susceptible to weird IFS values, and realized that [p]bosh behaves differently than other shells.

    To be on the safe side one should use eval "$foo=\$bar", eventhough in all other current shells (and ksh93) it's enough to quote only $foo.

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-26

    The background for the rewrite is that it would
    make bosh the fastest shell if you use "configure" as the test scenario as I expect a
    performance win of approx. 20%.

    Interesting. I'm generally interested in shell performance, and contributed some patches to the ffmpeg project which speed up its configure considerably (it's a custom script, no autotools etc).

    FWIW, in my experience, the fastest shells are dash and in general ash based shells, and ksh93 to some extent (depending if it can apply its no-fork-subshell optimizations). Bash tends to be on the slow side, though there are slower. I didn't actually try to evaluate [p]bosh in terms of performance.

     
  • Jörg Schilling

    Jörg Schilling - 2019-04-26

    If you make dash POSIX compliant by adding multi byte character support, it would be the slowest shell ever. bosh is currently approx. 5% faster than dash even though it supports multi byte characters and 30% of the CPU time used by bosh is used for multi byte character handling.

    The original ksh93 is approx. 10% faster than bosh but the RedHat variant already has been made slower than the original by replacing code from David Korn with what the Redhat people believe is "standard code".

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-26

    I knew that dash doesn't do multibyte, but to be honest I didn't encounter scripts which require multibyte and performance together, though I'm sure there are such, and I think it's great to support it.

    FWIW, I tested few shells on the same Ubuntu 18.04 LTS system (all built by me with default settings and fairly recent code base, except ksh93 which is an Ubuntu package binary), and used <sh> ./configure at the ffmpeg source tree root. Here are my results:

    18473 ms  pbosh
    16743 ms  mksh
    14138 ms  loksh
    11458 ms  bash
     9290 ms  ksh93
     8589 ms  dash
    

    I know that busybox ash is ~5-10% slower than dash, and FreeBSD sh (running on FreeBSD) is roughly similar to dash. loksh is supposed to be OpenBSD sh ported to linux ( https://github.com/dimkr/loksh ), but I don't know how it performs on an actual OpenBSD system.

    I didn't try to analyze where the time is spent within configure with each shell.

     
  • Jörg Schilling

    Jörg Schilling - 2019-04-29

    Well, ffmpeg does not use autoconf but a hand written shell script that it not "compliant"
    as it does not quote '^' (as required by the POSIX standard) and as it calls programs like sed
    with non-standard options.

    For this reason, it is not possible to check the ffmpeg script on an arbitrary certified UNIX
    and the script seems to be an example how scripts should not be written.

    Looking closer at the script, shows that this shell script causes the shell to spend most
    of the time in macro expansion, which is untypical for average shell scripts. So this script
    can be mainly seen as a testcase for macro expansion performance. This is where the
    mentioned rewrite of bosh will happen, so a future version of bosh will be faster.

    If you compare shells with autoconf based shell scripts, like the "configure" from the
    schilytools, you get different results. Here is what I get on a Opteron based UNIX
    system from 2006 (newer CPUs typicalls show less differences between the various
    shells and since Linux does not implement a true vfork(), bosh and ksh93 are
    slower on Linux than they are on a typical UNIX system).

    sh      50,298552 real 15,685686 user 30,800084 sys 92% cpu 3+2389io
    obosh   52,755840 real 16,875133 user 32,183116 sys 92% cpu 5+2247io
    bosh    45,560913 real 15,564520 user 25,524520 sys 90% cpu 0+2558io
    
    bash    57,944533 real 16,676010 user 38,879194 sys 95% cpu 0+2571io
    dash    47,524207 real 14,621246 user 28,960684 sys 91% cpu 0+2943io
    ksh     48,020499 real 15,506324 user 28,981109 sys 92% cpu 0+2522io
    ksh93   39,799008 real 14,410227 user 22,248606 sys 92% cpu 0+2551io
    mksh    45,189913 real 15,413777 user 26,860620 sys 93% cpu 0+2285io
    posh    50,147204 real 16,019424 user 30,550703 sys 92% cpu 0+2888io
    yash    53,016180 real 16,272928 user 33,322202 sys 93% cpu 0+2941io
    zsh     54,444366 real 15,665331 user 35,274830 sys 93% cpu 0+2530io
    
    sh      is the SVr4 Bourne Shell 
    obosh   is the SVr4 Bourne Shell based on malloc() instead of sbrk() 
    ksh     is ksh88 
    posh    is a mostly broken pdksh variant from Debian
    

    Given that sh and obosh mainly differ in the fact that obosh uses a malloc() based string stack replacement, the performance difference is caused by that change.

    If you compare obosh and bosh, you see the performance win from a better pipe construction method in the interpreter and in the performance win from using vfork().

    The performance advantage with bosh compared to dash does not exist on Linux as linux does not come with a working vfork(), there is a vfork emulation on Linux that just implements all pitfalls of vfork without implementing the advantages that come from the fact that vfork does not need to copy the address space description in the kernel. This makes vfork 3x faster than fork on a UNIX system.

    ksh88 and ksh93 differ in the fact that ksh93 implements virtual sub-shells and uses vfork().

    mksh is interesting since it is the only shell that spends less than 28 seconds with system CPU time even though it does not usevfork().

    As busybox is not portable, I could not test it on UNIX but on Linux it is faster than dash.

    Note that you need to call:

    CONFIG_SHELL=$shell $shell ./configure

    for a correct test that always uses the shell under test.

     

    Last edit: Jörg Schilling 2019-04-29
  • Avi Halachmi

    Avi Halachmi - 2019-04-29

    Thank you for the detailed information.

    is not "compliant" as it does not quote '^' (as required by the POSIX standard)

    The only reference I could find in the spec to "^" or "circumflex" is that it's unspecified behavior to use it as negation in a character pattern, but I did not see such usage in ffmpeg's configure. I can't see it specified as a special char which needs to be quoted either. I can send patches if you show me an example where it's incorrectly unquoted.

    and as it calls programs like sed with non-standard options.

    I'm not familiar enough with sed to comment, but if you could point out an example for me then I can send a patch.

    spend most of the time in macro expansion, which is untypical for average shell scripts. So this script can be mainly seen as a testcase for macro expansion performance. This is where the mentioned rewrite of bosh will happen, so a future version of bosh will be faster

    Judging by your earlier comments, I'm assuming "macro" means parameter/command substitutions? Anyway, nice to know that it will be improved.

    posh is a mostly broken pdksh variant from Debian

    Yeah, I'm familiar with it and its general not-up-to-par behavior.

    Thanks for the enlightening information about where performance goes, and the vfork issue.

     
  • Jörg Schilling

    Jörg Schilling - 2019-04-29

    '^' was an alias for pipes in the 1970s to allow pipes on upper case only terminals. bosh still supports it as Solaris did come with /bin/sh being the classical Bourne Shell for a long time and as POSIX definitely does not make any assumptions on what shell /bin/sh is. It can be disabled in bosh via set -o posix and it is disabled by default in pbosh.

    We recently added '^' to the set of characters that need quoting in the POSIX standard because of some reported issues with globbing and pattern matching.

    sed is called with -E but there is more (e.g. calling grep with wrong options).

    Note that I did not check whether the options for grep may be valid in POSIX mode, but the script does not contain the standard calling procedure to go into POSIX mode which is to set up PATHfrom the results of getconf PATH followed by calling sh. For a portable script it is not easy to be compilant, as e.g. Mac OS does not mention how to get into POSIX mode and as you should not assume compliance with the most recent POSIX standard.

    Oracle Solaris 11.4 is e.g. the only UNIX certified for the most recent POSIX standard but it will not be able to run the ffmpeg script.

    The term macro expansion is defined in the POSIX standard and is related to sell variable expansion with additional features such as ${var-value}.

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-29

    We recently added '^' to the set of characters that need quoting in the POSIX standard

    The term macro expansion is defined in the POSIX standard

    I'm unable to find any public reference to either of those. Neither in what I think is the latest version: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html nor otherwise with google. The only pages I get for "posix macro" are related to the C preprocessor.

    I believe you that configure won't work on Solaris. The closest I could put my hands on is OpenIndiana (SunOS Release 5.11), which appears to be using AT&T ksh93t+ (2010) as its default sh, where "^" is not interpreted as a pipe (ffmpeg is available for OpenIndiana, but I haven't looked at any patches they may be applying).

    For a portable script it is not easy to be compilant

    I guess. I don't mind spending some time helping to make configure more compliant, especially with the shell code parts (in contrast to sed/grep/etc compliant usage, which I'm less familiar with), but I probably won't do that without seeing a spec which mentions it ("^").

    Anyway, I know and understand that ffmpeg's configure is different than autoconf configure, and that their performance depends on different shell features.

     
  • Jörg Schilling

    Jörg Schilling - 2019-04-29

    Sorry, macro is the internal name in the Bourne Shell source, see macro.c POSIX calls it parameter expansion.

    Then also look at: http://austingroupbugs.net/view.php?id=1190 and
    http://austingroupbugs.net/view.php?id=1191#c4278

    ksh is no problem on Solaris, the problem is that ffmpegs script is non-portable:

    ksh93 configure 
    sed: illegal option -- E
    Usage:  sed [-n] script [file...]
            sed [-n] [-e script]...[-f script_file]...[file...]
    

    ... and if you would remove that, it would complain about grep -q because it does not set up a PATH to use the POSIX variants of the binaries. Note that /usr/bin/grep is callled instead of /usr/xpg4/bin/grep.

    Also note:

    sh configure
    Broken shell detected.  Trying alternatives.
    Trying shell bash
    sed: illegal option -- E
    Usage:  sed [-n] script [file...]
            sed [-n] [-e script]...[-f script_file]...[file...]
    

    This is not a broken shell, but just a non-POSIX Bourne Shell acting on a non-portable script.

    The problem with that script seems to be that it assumes Linux behavior on all possible platforms.

     

    Last edit: Jörg Schilling 2019-04-29
  • Avi Halachmi

    Avi Halachmi - 2019-04-29

    Then also look at: http://austingroupbugs.net/view.php?id=1190 and
    http://austingroupbugs.net/view.php?id=1191#c4278

    Thanks. It is quite recent (few weeks ago). I guess it should make it into the next version of the spec.

    The problem with that script seems to be that it assumes Linux behavior on all possible platforms.

    As far as I can tell it does try to find a sutable shell, and i know for a fact that it works out of the box also on the different BSDs, AIX, and OSX.

    Apparently it fails on some cases where it could do better, though because I don't have access to such systems (practically only Solaris detivatives?), I can't try it out.

    Though as I mentioned, I saw that OpenIndiana does have an ffmpeg package. I don't know if they applies some patches or whether it configures/builds out of the box.

    Anyway, if you can afford to send patches to ffmpeg to make configure more compliant, I'm sure no one would object.

    Thanks again for your time, patience and good info.

     
  • Jörg Schilling

    Jörg Schilling - 2019-04-30

    The hint to quote ^ is recent, but it is just a bugfix to the POSIX standard.

    Thanks for the hint on finding a suitable shell. Sometimes it is a good idea to recheck things in a real window with more lines. I first checked the code at weekend on a laptop and did not see the check for /usr/xpg4/bin. So it seems that some time ago, the authors of ffmpeg did care about the standard and switched to the POSIX environment and my fear that grep -q could fail later is not correct.

    Now I would guess that unless you use bosh, there is only one problem with sed -E in the script and the problem with bosh could not be known by them since in former times, Solaris did always come with a true old Bourne Shell in /bin/sh which is 100% compatible to oboshand the check for POSIX (ksh88) enhanced parameter susbstitution with ${foo%%bar} did work on previous Solaris versions.

    Now the Bourne Shell has been enhanced to support POSIX but in a way that does not break old existing scripts and OpenSolaris may have bosh installed as /bin/sh. For the related problem, it would help to quote '^'.

    Since ^ is quoted in most of the cases already (except for line 4381, 4401, 4521, 6513), it would be simple to add quoting for that too after sed -E has been removed.

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-30

    after sed -E has been removed.

    As far as I can tell, sed -E is only used only in one line, and only if $target_os is darwin, which is OS-specific and apparently works (even if not robust, e.g. cross-compiling to darwin). I don't think there's any urgency in changing it, and personally I don't know how to change extended regex to standard one before further research on my part.

    Since ^ is quoted in most of the cases already (except for line 4381, 4401, 4521, 6513), it would be simple to add quoting for that too

    So basically the only thing preventing it from being reasonably compliant (to the extent which you noticed so far), including with bosh, is quoting "^"?

    My configure file has different line numbers (I guess we look at different versions - I'm looking at ffmpeg master git repository), but I do see the 3 first cases which clearly need quoting in bosh, though the 4th is a here document which doesn't need it as far as I know.

    I can confirm that configure with bosh failed before quoting the first 3 instances, and succeeds after quoting them. As expected, configure with pbosh succeeds also with them unquoted. On both success cases I also compared the output of configure with other shells.

    There are two diffs of the same "class" (SAMPLES and TARGET_SAMPLES in ffbuild/config.mak), and I think it's a bug with [p]bosh. The following:

    x=; IFS= read -r v <<EOF; printf "[%s]" "$v"
    ${x:-\$y}
    EOF
    

    prints [$y] in all shells, except [p]bosh which print [\$y].

    Other than these two diffs, all the output files are identical as with dash or ksh93.

    I'll try to find the time to send a patch to ffmpeg which quotes these 3 instances of "^".

     

    Last edit: Avi Halachmi 2019-04-30
  • Jörg Schilling

    Jörg Schilling - 2019-04-30

    I get two sed -E occurrences:

    /tmp/ffmpeg-4.1.3/configure:3721:    sed -E -n "s/^extern AVFilter ff_([avfsinkrc]{2,5})_([a-zA-Z0-9_]+);/\2_filter/p" $file
    /tmp/ffmpeg-4.1.3/configure:5187:        VERSION_SCRIPT_POSTPROCESS_CMD='tr " " "\n" | sed -n /global:/,/local:/p | grep ";" | tr ";" "\n" | sed -E "s/(.+)/_\1/g" | sed -E "s/(.+[^*])$$$$/\1*/"'
    

    If you know a fix for that -E problem, I could test it on Solaris.

    Regarding your parameter expansion issue, I would need some time to investigate in that.

    I am not sure whether there is a bug in bosh, since using

    ${x:-$\y}

    prints [$\y] with all shells.

    setting up R=\$y before and then use:

    ${x:-$R}

    prints [$y]

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-30

    If you know a fix for that -E problem, I could test it on Solaris.

    The second of those is the darwin one. The first was fixed 5 months ago here https://github.com/FFmpeg/FFmpeg/commit/2f6b1806 .

    So configure in ffmpeg git master should not try to use sed -E anywhere except on Darwin/OSX.

    As for the here-document backslash, I can only say that all other shells I have agree between themselves and think differently than [p]bosh. On the face of it I think they're right, and I don't think your example demonstrates an inconsistency, but there could always be some edge cases which I didn't interpret correctly.

     

    Last edit: Avi Halachmi 2019-04-30
  • Jörg Schilling

    Jörg Schilling - 2019-04-30

    The -E fix seems to work on the first view, but causes a "Terminated" message. I checked the whole with truss -f and it seems this is from a ffmpeg program

    /tmp/ffconf.XXl7aO2F/test

    whatever this is...it is not a bosh problem. Bosh just reports it in contrary to bash.

    BTW: neither ksh88 nor ksh93 work with that on Solaris:

    ksh  configure --disable-x86asm  
    configure[92]: test: argument expected
    configure[92]: test: argument expected
    configure[2]: 18227 Terminated
    
    ksh93  configure --disable-x86asm                     
    [1] + Beendet                     ksh93 configure --disable-x86asm
    

    so bosh seems to be better ;-)

    Given that both:

    echo \$y

    and

    echo $\y

    prints $y with any shell, I would expect similar orthogonal behavior in the parameter expansion from your example as well. Do you have a portion in the POSIX standard that requires non-orthogonal behavior in the case you reported?

     

    Last edit: Jörg Schilling 2019-04-30
  • Avi Halachmi

    Avi Halachmi - 2019-04-30

    BTW: neither ksh88 nor ksh93 work with that on Solaris:

    Is there a system I can install in virtualbox and is similar to yours? Would OpenIndiana suffice?

    prints $y with any shell

    Yes, because your examples are outside of double quote, so in \$y the backslash removes the special meaning of $, and in $\y it's simply removed, but then not parameter-expanded because it's not a $<valid-thing-to-expand> (quote removal happens after parameter expansion). it's the same as printf %s%s $ y. So they end up the same but due to different reasons.

    According to posix, in an "expanding" here-document, a backslash behaves like a backslash inside double quotes.

     

    Last edit: Avi Halachmi 2019-04-30
  • Jörg Schilling

    Jörg Schilling - 2019-04-30

    If you make sure that /usr/gnu/bin/is not in front of the PATH, it should work on OpenIndiana. There is a problem: sed was closed source as a collaborative delevopment with IBM and OpenIndiana does not have the Solaris sed but rather the FreeBSD sed that supports -E.

    So if you like to check for the possible problems on a certified POSIX issue 7 tc2 UNIX, you should try to fetch the free version of Oracle Solaris-11.4.

    SchilliX may be available in a newer variant late this year.

    Your claim with double quote like behavior for here documents looks reasonable. I'll investigate whether I could change the behavior for bosh.

     
  • Avi Halachmi

    Avi Halachmi - 2019-04-30

    For reference, in this command:

    <sh> -c 'e= x=VAL; printf [%s] 1\$x 2$\x 3"\$x" 4"$\x" 5"${e:-\$x}" 6"${e:-$\x}"'
    

    Except for the final argument, all shells agree on:

    [1$x][2$x][3$x][4$\x][5$x]
    

    While for the final argument, some shells think it should print [6$\x], while other shells print [6$x]. (EDITED, accidentally replaced bosh with posh. see actual output below).

    The here-document case of ffmpeg's configure is most similar to 5"${e:-\$x}" where all shells, including [p]bosh, agree on 5$x when inside double quotes, but [p]bosh interprets it differently than others while inside a here-document.

    Here's how they split:

    $ shcmp -c 'e= x=VAL; printf [%s] 1\$x 2$\x 3"\$x" 4"$\x" 5"${e:-\$x}" 6"${e:-$\x}"'
    
    = /bin/sh, bash, bash_posix, dash_58, dash_591, dash_5102, dash_master, ksh93, ksh_master, mksh, mksh_posix, bb_ash, bosh, pbosh:
    [1$x][2$x][3$x][4$\x][5$x][6$\x]
    
    = loksh, oksh, pdksh, yash, yash_posix, posh:
    [1$x][2$x][3$x][4$\x][5$x][6$x]
    

    shcmp is https://github.com/avih/shcmp

     

    Last edit: Avi Halachmi 2019-04-30
  • Avi Halachmi

    Avi Halachmi - 2019-05-21

    Slightly off topic:

    30% of the CPU time used by bosh is used for multi byte character handling

    Is multibyte relevant anywhere other than to calculate ${#foo}, while matching ? in a pattern outside of a bracket expression, and while matching chars in a pattern inside bracket expression (both literals and classes)?

    I can't think of other places where it would matter for anything...

     
  • Jörg Schilling

    Jörg Schilling - 2019-05-21

    If you are on a UTF-8 based locale, try:

    > dash
    > $ IFS=ö
    > $ A=Jörg
    > $ echo $A
    > J  rg
    > $ 
    

    There are three arguments for echo (looks like two spaces between the visible words) the middle argument is empty.

    If you are in a UTF-8 terminal this works to compare:

    > dash -c 'IFS=ö; A=Jörg; echo $A'
    > J  rg
    > bosh -c 'IFS=ö; A=Jörg; echo $A'
    > J rg
    

    Try this with any shell that could get a UNIX branding and you get only one space between J and rg.
    There are many more cases where it is important that the shell knows the margins of true characters (not bytes).

     
  • Avi Halachmi

    Avi Halachmi - 2019-05-21

    Thanks. Right, IFS is also char-based so it's affected as well.

    So ${#foo}, IFS, ? wildcard in a pattern, and inside pattern bracket expression, and that's it?

     

    Last edit: Avi Halachmi 2019-05-21

Log in to post a comment.

MongoDB Logo MongoDB