#2469 matching word break also matches interior of word

normal bug
open
nobody
4
2012-04-22
2005-10-12
Alan G
No

In 4.2 final

Searching with the regular expressions

\b. or \<. jumps to the beginning of a word, unless the
cursor is already in a word, in which case it finds the
next character within the word.

This makes it impossible to replace characters only at
the beginning of a word.

For instance:

search on
\<.

replace with BeanShell snippet
_0.toLowerCase()

should force the first character of every word to be
lower case. However, it actually forces *every*
character of every word to be lower case.

Discussion

  • Björn Kautler
    Björn Kautler
    2005-10-13

    Logged In: YES
    user_id=918212

    As far as I have found, \b, \< and some others are not
    supported by the gnu.regex-Package.
    One additional point for throwing it away and using
    java.util.regex instead. ;-)

     
  • Robert Schwenn
    Robert Schwenn
    2008-03-02

    Logged In: YES
    user_id=1486645
    Originator: NO

    The gnu.regex-Package is thrown away now. Not so the bug.

    For example:
    - The Expression "\b." matches *every* single charcter in a word.
    - The Expression ".\b" matches the last charcter in a word (as expected) but also a following space character.

    jEdit 4.3pre12
    JRE 1.6.0_03
    WinXP SP2

     
  • Alan Ezust
    Alan Ezust
    2012-01-19

    • milestone: 101608 --> 101607
    • priority: 5 --> 7
     
  • Steve Jakob
    Steve Jakob
    2012-01-19

    I don't think this is a bug. I see 2 issues with Alan G's approach:
    1) Note that the boundary matching characters match the boundary and not the characters themselves. Each word has two boundaries, one BEFORE the first character and one AFTER the last. Your regex will match the first boundary of your complete search string, which occurs before the first word character.
    2) The replace string you specify indicates that you want to replace every character with a lower-case character since "_0" refers to the complete contents of your searched text. To clarify, if my search text is the string "this text" Alan's BeanShell snippet is equivalent to "this text".toLowerCase().

    As an alternative, the following appears to work for me:
    Search regex: \b(\w)(\w*)
    Replace with BeanShell snippet: _1.toLowerCase() + _2

    By separating the first character of each word (following a boundary) from the rest I can transform just that one character.

     
  • Robert Schwenn
    Robert Schwenn
    2012-01-21

    Just tried again:

    1. ".\b" matches the word boundary and the preceding character *as expected*
    2. "\b." matches *every single character* in a word, which is *a bug*, isn't it?

    jEdit 4.4.2 and jEdit 4.5pre1
    JRE 1.6.0_24
    WinXP SP3

     
  • Bosse Iseborn
    Bosse Iseborn
    2012-01-22

    As noted, \b matches word boundaries, i.e. both the beginning and end of a "word". So \b. will indeed match the first letter of any word, but also the first whitespace character (and any other non-word character such as the dot after a sentence) AFTER every word. (Though that doen't matter if you just want to upper-case it).

    If you want to upper-case the first letter of every word you should use \b\w instead, or even \b[a-zA-Z] (or similar) depending on whether the search or the conversion is slowest.

    I'm guessing that when doing this in jEdit, that after each match has been handled, the pattern is applied again to whatever comes after the last matched character. This will indeed cause every character in a word to match, since the first position of every string will match \b. In other words, the first match for \b in any non-blank string is (and should be) ^, i.e. the first position of the matched string.

    In that case it is not a bug, and the correct way is indeed to use something like \b(\w)\w* as suggested by Mr. Jakob. That would let each match consume the rest of the word so that it is not matched in the next iteration.

     
    • milestone: 101607 --> normal bug
    • priority: 7 --> 4
     
  • Search and replace design does not allow to fix this without significant interface extension. SearchMatcher class has findNext method which always starts from 0 index. It does not allow to supply a different index. If findNext method has no access to the previous characters, it is not able to perform a word boundary search correctly. So I don't expect this to be fixed soon.

    A fix would require much attention because there are many clauses for reverse search which must be taken into consideration. I'm not going to do it.

    I don't think it is really a crucial functionality, so lowering the priority. I even have a workaround. First do replace all "\b" with "X" (this works), then all "X." with a suitable java snippet. Of course X must be substituted with something that is not contained in the file.