Scintilla / Feature Requests / #1305 Optimize Lexer's WordList.Set() method

#1305 Optimize Lexer's WordList.Set() method

Milestone: Committed

Status: closed

Owner: Neil Hodgson

Labels: lexer (114) wordlist (1) optimization (8)

Priority: 5

Updated: 2019-10-24

Created: 2019-07-16

Creator: Zufu Liu

Private: No

See these code snippets

// LexerBase::WordListSet()
WordList wlNew;
wlNew.Set(wl);
if (*keyWordLists[n] != wlNew) {
    keyWordLists[n]->Set(wl);
    return 0;
}

// LexerCPP::WordListSet()
WordList wlNew;
wlNew.Set(wl);
if (*wordListN != wlNew) {
    wordListN->Set(wl);

after calling SCI_SETKEYWORDS, the keywords is split and sorted twice if it's initialized empty (default) or set to a different.
Copy wlNew's internal state into target with a method like wordListN->Reset(wlNew), or directly call wordListN->Set(wl) without comparison is more efficient, especially for large set of unsorted keywords.

Discussion

Zufu Liu - 2019-07-16

Patches added:

1305-strlen.diff
avoid strlen() in ArrayFromWordList() since the length is already known.

1305-Reset.diff
Added WordList::Reset(WordList &other) and applied in LexerBase::WordListSet(). Apply Reset() to other files in lexers folder will need individual commits.

1305-Reset.diff

1305-strlen.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-07-16

Replaced nullptr with 0.

1305-Reset-2.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-07-16

Another patch: move equal comparison into bool WordList::Reset(const char *s), replace void Reset(WordList &other).

1305-Reset-bool.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-09-27

From the point of view of the client code, is there a benefit from having both Set and Reset or should Set's implementation work like Reset and return a bool?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-09-28

I found the name Reset is just confusing, let Set return bool is good.

Possible, the following (without the complicated to compare every word in the set) would good enough for most application:
1. most keywords are fixed (hard-coded in application or it's configuration file) when applying a lexer.
2. words that dynamic collected and set by application trend to changed on each set

bool WordList::Set(const char *s) { if (list != nullptr && strcmp(list, s) == 0) { return false; } // rest code return true; }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-09-28
  
  list differs from s in that it has NULs replacing spaces after each word (WordList.cxx line 53) so a simple strcmp wont work.
  
  An easy implementation is to always call ArrayFromWordList and sort the input and then compare the result.
  
  BTW, onlyLineEnds was for SciTE's use when WordList was shared with SciTE. It should probably be deprecated then removed in the next major release.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-09-29

summary: Optimize Lexer's WordListSet() method --> Optimize Lexer's WordList.Set() method
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-09-29

OK, update the patch. onlyLineEnds and other lexers is changed.

WordList-Set.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-09-30

labels: lexer --> lexer, wordlist, optimization

Group: Completed --> Committed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-09-30

Committed with a minor formatting change and some unit tests as [f08486].

If there is one lexer where you think this is strongly beneficial then that can be updated. Please do not send patches for other lexers.

Related

Commit: [f08486]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2019-09-30
  
  Other lexers (that inherited from DefaultLexer) will not updated by me, I not use these lexers: they yield bigger binary than old stateless ColouriseDoc/FoldDoc lexers.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2019-09-30
    
    Maybe LexSQL (many keywords), LexCPP (used by many language), LexHTML (many keywords) can be changed to use the new method.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2019-09-30

Since adopting the new Set method doesn't break lexer's functionality, I think LexPython and other lexers not maintained by external developers can also adopting the changes. When people write new lexers or update existing lexers, they are most likely follow these well-written lexers. This is an opportunity to remove all double-set codes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2019-09-30
  
  This is an opportunity to remove all double-set codes.
  
  Which is what I do not want to see. Making multiple similar changes over a project leads to mistakes as the developer stops looking closely and testing each instance. Global changes of technique are one of the biggest causes of instability.
  
  The possibility of new bugs has to be balanced against a small increase in performance that is likely not noticeable to users.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-10-24

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2019-10-24

Committed with minor formatting change and some unit tests as [f08486].

If there is one lexer where you think this is strongly beneficial then that can be updated. Please do not send patches for other lexers.

Related

Commit: [f08486]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.