Menu

#1967 Bash Lexer - Here-strings Not Recognized Properly

Bug
closed-fixed
nobody
5
2023-06-18
2017-08-06
No

This may seem cosmetic (I saw the closed invalid/cosmetic bugs regarding this), yet it is evidence that the lexer is flawed and not recognizing here-strings properly.

This may be low priority yet it should get fixed - the lexer should know what a here-string is and color accordingly.

#!/bin/bash
TEMPLATEBLOCKHEADER=$(cat << EOF
    ####################################################################
   ###################  T E M P L A T E -- S T A T I C - v1.0
  ##############____ _     ___  ____    _    _     ____  
 ##############/ ___| |   / _ \| __ )  / \  | |   / ___| 
##############| |  _| |  | | | |  _ \ / _ \ | |   \___ \ 
##############| |_| | |__| |_| | |_) / ___ \| |___ ___) |
 ##############\____|_____\___/|____/_/   \_\_____|____/ 
  ####################################################################                              
   ####################################################################                                           
EOF
)

Related

Bugs: #1967

Discussion

  • Kein-Hong Man

    Kein-Hong Man - 2017-08-08

    This is a bug, but fixing it is currently not my priority. The bash lexer balances parentheses for $(), and handles single quote and double quote escaping of parentheses correctly but I missed (or ignored, ha ha) HEREDOCs.

    I can't fix this anytime soon because scanning these things are an ugly mess. One basically has to parse the whole thing, lexing breaks in all sorts of ways. No thanks to the uber geniuses who inflicted such syntax upon us.

    Historically Scintilla has implemeted styling of $() as a kind of string, just like its backticks brother. Now, removing $() strings and styling the insides as normal bash code is tempting, but past bug reports indicates users are fond of using $() in "" strings. For example, a simplified past SF bug report:

    TEST="$(echo "echo )")"; echo $($TEST)

    Delimiter behaviour changes according to the latest inner string. So you need a stack to remember the changing behaviour. Adding HEREDOCs into this mess is likely non-trivial, and a simple patch may be fragile.

    Now assume we will have coders who are very fond of using the VAR="string $(command) string" style. So here is an example of bash doing what the coder wants (but is a huge pain for lexers to scan):

    echo "foo $(cat << EOF
    Try "th"is"
    EOF
    ) boo"

    It looks like the double-quotes are unbalanced, but bash indulges the coder and delimiter behaviour changes for each of the string-like delimiter levels. It's not possible to scan this correctly using normal string scanning methods.

     
  • AMDanischewski

    AMDanischewski - 2017-08-08

    I don't want to oversimplify - I know it can be complex. But, have you tried scanning first for the here-strings? After you find the here-strings, you can save the beginning line and character offset, then the end line and character offset. Then exclude that section from having the other rules applied - coloring it up as a text string.

    There are different ways to implement this, for instance you could build an associative array of here-strings with unique token for each. Then clip out of the text on the first pass all here-strings leaving a perfect simple string token (e.g. "TOKEN123AB"). After all the rules are applied as a last pass put the here-strings back in.

    (This type of logic if not already present, will probably come in handy for other situations.)

     

    Last edit: AMDanischewski 2017-08-08
  • Kein-Hong Man

    Kein-Hong Man - 2017-08-08

    Look at the following samples. Both are double-quoted strings.

    echo "foo $(cat << EOF
    Try "th"is"
    EOF
    ) boo"

    echo "<<EOF"

    HEREDOC is only recognized in the first one because $() is processed inside the "string". Bash switches string behaviour to backticks within the $() and the HEREDOC is processed. In the second, the << is just literal characters. Bash allows this kind of nesting, and shell coders will tend to use (cough abuse) $ constructs in "strings". One can even abuse this kind of string to nest to infinity, but Scintilla has a stack limit of 8 I think and I have tests up to 9 levels of nesting.

    Implementing your example alone is a quick solution, but people are using $() in "strings", so it will only be a matter of time before we get another bug report. Sure it can be fixed, but I don't want to rush the fix. There are many kinds of strings and several ways of nesting them. The chief culprit is $() which is what bash intends it for anyway, to allow code to be inserted in here and there to arbitrary levels of complexity. Elegant, they say. So given the Bash lexer arch, a stack is needed, and behaviour is set for each level change. The HEREDOC infrastructure is not tiny, it's best to limit the recognition to a common subset. Not a simple fix.

     
  • AMDanischewski

    AMDanischewski - 2017-08-08

    Maybe you didn't see the update to my response, updates don't seem to be transmitted via email.

    In the first pass if you have any here-string it will have the form of cat << TOKEN (outside of commented sections), TOKEN ends the here-string on the first line where it is found alone (regex=/^TOKEN$/). That part can be stripped out - starting from the cat << TOKEN to /^TOKEN$/ and stored in an associative array with a unique TOKEN identifier (maybe including SOH unprintable ascii and a serial number) (e.g. "TOKEN123"). After all here-strings are replaced then the highlighting rules can be applied as normal. Then in the last pass the unique TOKEN identifiers are replaced with their respective here-strings stored in the associative array.

     
  • Kein-Hong Man

    Kein-Hong Man - 2017-08-09

    I got all the postings via e-mail, no problem.

    Not everybody is going to use cat with HEREDOCs.

    Feel free to offer a patch. Currently I am not able to work on this immediately, but it's queued up.

     
    • AMDanischewski

      AMDanischewski - 2017-08-09

      I don't C/C++ much but I had a quick look and I see where you already have
      in the comments that the here-doc starts after the >>.

      LexBash.cxx, lines 600-604:
      // Must check end of HereDoc state 1 before default state is handled
      // if (HereDoc.State == 1 && sc.atLineEnd) {
      // Begin of here-doc (the line after the here-doc delimiter):
      // Lexically, the here-doc starts from the next line after the >>, but
      // the first line of here-doc seem to follow the style of the last EOL
      // sequence

      That's almost true, yet in Bash there is the bit shift operator too.
      echo $((192 << 23))
      echo $((50 >> 2))

      I don't have much time myself and I am unfamiliar with this code, but I may
      be able to provide something in C that can handle here-docs and maybe then
      you or someone else can meld it into Scintilla.

      On Tue, Aug 8, 2017 at 9:50 PM, Kein-Hong Man khman@users.sf.net wrote:

      I got all the postings via e-mail, no problem.

      Not everybody is going to use cat with HEREDOCs.

      Feel free to offer a patch. Currently I am not able to work on this
      immediately, but it's queued up.


      Status: open
      Group: Bug
      Created: Sun Aug 06, 2017 08:27 PM UTC by AMDanischewski
      Last Updated: Tue Aug 08, 2017 11:03 PM UTC
      Owner: nobody

      This may seem cosmetic (I saw the closed invalid/cosmetic bugs regarding
      this), yet it is evidence that the lexer is flawed and not recognizing
      here-strings properly.

      This may be low priority yet it should get fixed - the lexer should know
      what a here-string is and color accordingly.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13

      !/bin/bashTEMPLATEBLOCKHEADER=$(cat << EOF #################################################################### ################### T E M P L A T E -- S T A T I C - v1.0 ##############_ _ ___ _ _ ____ ##############/ ___| | / _ | __ ) / \ | | / ___| ##############| | | | | | | | _ \ / _ \ | | _ \ ##############| || | |__| || | |) / ___ | | ) | ##############_|__/|_// ____|____/ #################################################################### #################################################################### EOF)


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/scintilla/bugs/1967/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #1967

  • AMDanischewski

    AMDanischewski - 2017-08-09

    I think the logic for a here-doc may be the << followed by practically any limit string - it can have a $ in the beginning and the limit string can actually be an escaped parenthesis.
    Also, there is tab supression possibility after the <<- that auto indents.

    For instance the following is a valid here-doc:

    figlet <<-  \) | rev
        hi 
        here
    )
    

    This regex seems to get them all:
    grep -P '<<[-]?[\s]*([^\s)]|([\\]\)))+.*$' <<< 'cat <<- \) '

    Update:
    grep -P '<<[-]?[\s]*([^\s)]|([\\]\)))+((?!(\)\)))|[^)"'\''])+$'
    To handle things like:

    echo "what << is this" 
    echo 'what << is this'
    echo $((192 << 23))
    echo "oh $(cat << EOF | rev
    "ti"hs" On 
    EOF
    ) here-doc"
    

    Update again:
    grep -P '<<[-]?[\s]*([^\s)]|([\\]\)))+((?!(\)\)))|([^)"'\''])|("[^"]*")|('\''[^'\'']*'\''))+$'
    To handle things like:

    cat <<-LIMITSTR | rev | sed 's/sh/sh\!/g'
        hsab 
        cod er-eh
    LIMITSTR
    
    cat <<-LIMITSTR | rev | sed "s/sh/sh\!/g"
        hsab 
        cod er-eh
    LIMITSTR
    
    cat <<-LIMITSTR | rev | sed "s/sh/sh\!/g" | sed 's/b/BBB/g'
    hsab
    cod er-eh
    LIMITSTR
    

    Then all that's left is to grab the limit string and look for it on a line by itself.

     

    Last edit: AMDanischewski 2017-08-09
  • AMDanischewski

    AMDanischewski - 2017-08-09

    Here is a little parser that returns the filename:line no: LIMITSTRING for all here-doc's in *.bsh files in a directory.

    while read a; do 
     echo -n "${a} " | sed -r 's/(^[^:]*)(:[0-9]+)(.*$)/\1\2: /'
     echo "${a}" | sed -r 's/^.*<<[-]?\s*//;s/\s.*//'
    done < <(grep -nP '^[^#]*[^<]<<[^<][-]?[\s]*([^\s)]|([\\]\)))+((?!(\)\)))|([^)"'\''])|("[^"]*")|('\''[^'\'']*'\''))+$' *.bsh) 
    

    It will fail on <<- syntax embedded within other here-doc's yet you can sort that out by the line number, after finding the LIMITSTRING of the first one, you can tell that a subsequent returned limit string is within its line range.

     
  • Kein-Hong Man

    Kein-Hong Man - 2017-08-09

    Thank you, it looks very interesting.

    I'm not sure what I can do with that for now. I guess C++14 or something or other has built-in regex capabilities, I don't know if Scintilla lexers has started using such regexes, and my C++14 familiarity is zero at the moment. But I hope to get up to speed on C++14, everything else and start working on my TODO list.

     
  • AMDanischewski

    AMDanischewski - 2017-08-10

    When you get a chance maybe you can take a look at PCRE, http://www.pcre.org/. That's what Apache and PHP use for regexes.

     
  • Kein-Hong Man

    Kein-Hong Man - 2017-08-10

    Gee man, you don't have to school me on pcre, I'm not a newbie.
    Neil has positioned Scintilla 4.0 to use C++ 14 and above. So I checked, and it's C++ 11 that introduced regex. There has been talk of C++ regex that I haven't followed closely on the mailing list, so it would not be very intelligent of me to listen to certain advice.

    I will be less talky in the near future due to my trying to get things coded so I may not respond to further postings on this ticket. Thank you for your interest in these highlighting issues.

     
  • Zufu Liu

    Zufu Liu - 2023-06-18
    • labels: --> lexilla, bash
    • status: open --> closed-fixed
     
  • Zufu Liu

    Zufu Liu - 2023-06-18

    Fixed in Lexilla 5.2.5 (set lexer.bash.command.substitution to 1 or 2).

     

Log in to post a comment.