Scintilla / Bugs / #1734 [PATCH] Use binary search for WordList::InList()

Jiri Techet - 2015-06-08

These two profiles show the performance with the mentioned linux kernel project with 100000 type-keywords. This is where the patch helps most because those 100000 (or its subset starting with the given character) don't have to be compared one by one.

kernel_after.svg

kernel_before.svg

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-08

The binary search is clearly a bit slower when used only for "normal" keywords. I tried to simulate this situation by only using the standard C keywords in a file with many words so many of the keyword checks have to be done. The WordList::InList() time usage has increased from 5% to about 12% so there's about 7% slow-down. However, still the majority of time is spent by lexing and the 7% performance difference is marginal. It would be possible to use the old search method if there's just a small number of keywords and switch to binary search for a bigger number of keywords but IMO the performance difference isn't worth the work (and amount of code).

big_file_after.svg

big_file_before.svg

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2015-06-08

The SCI_SETIDENTIFIERS mechanism is meant to be used for highlighting large numbers of identifiers, using a std::map to map between word and identifier style, and this is supported by the C++/C lexer. What does performance look like when using SCI_SETIDENTIFIERS?
http://www.scintilla.org/ScintillaDoc.html#SCI_SETIDENTIFIERS

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-08

But if I understand correctly, this will only work for lexers supporting it - and in Geany I see only CPP and Verilog. But what about all the scripting languages like python, ruby, etc.?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-09
  
  When applications add support, its likely more lexers will add support.
  
  Large C projects like Linux tend to have larger numbers of global identifiers than languages with better modularity.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-09

Well, I don't want to wait for several years before other lexers are updated (and I don't plan to do it myself). Geany tries to offer consistent set of features for all languages and I don't want to create some "privileged" set of languages with better support. So I'd prefer some solution which works now.

What about using the existing implementation as it is and only when the number of keywords is large enough, use the binary search? We need just the non-prefix, non-abbreviated case so the binary search can be used at a single place if you prefer keeping the existing implementation. Have a look at the attached patch.

patch.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-09
  
  Geary does not offer a consistent set of features for all languages since it relies upon lexers which vary widely in features, quality, and performance.
  
  If language equity is really of overriding importance then you should implement your identifier highlighting scheme independently. For example, you could use indicators and a background pass that is run after lexing. Similar to how spell-checking and highlight-current-word are often implemented.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-10

As I said Geany tries (which doesn't mean it does) to offer a consistent set of features and this particular feature is supported by all languages having a ctags parser, which is most of them. I'm afraid Colomban and others won't be very happy if I suggest to basically remove this feature for everything not using the CPP lexer.

Doing another pass is theoretically possible but why when the lexers already do the same? Moreover, we would be missing the information from the lexers which characters may appear in an identifier so this extra pass would have to include language-specific parts which is an additional thing to maintain. Effort-wise the patch doing this in geany would be much much bigger than the above patch. I'm afraid I won't do this.

Is there any problem with the second patch I provided where the binary search kicks in only when the keyword number is bigger than 1000 words? From the calculation I made this is guaranteed to be at least as fast as the existing implementation in all cases.

If you are completely against using the binary search, I'm afraid I will just keep Geany's implementation of this feature as it is even though it gets slower for longer keyword lists. I understand your point this is kind of misuse of the API but I don't want to write yet another language-specific parser for Geany (apart from ctags and the scintilla lexers we already use).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-10
  
  to basically remove this feature for everything not using the CPP lexer.
  
  Implementing SCI_SETIDENTIFIERS for lexers that support it does not require removing functionality from other lexers.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-10

I'm not talking about removing any functionality from lexers, I'm talking about removing functionality from Geany - the type coloring feature is supported by almost all languages now using SCI_SETKEYWORDS and unless I miss anything, SCI_SETIDENTIFIERS is supported only by CPP/Verilog lexers so doing it this way would remove this functionality from Geany for all languages using other lexers - am I right?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-10
  
  Test whether SCI_SETIDENTIFIERS is present and use it if it is, otherwise take the current SCI_SETKEYWORDS branch. To see if SCI_SETIDENTIFIERS can be used, call SCI_GETSUBSTYLEBASES and see if the identifier style is present.
  
  You could also white-list lexers in your application configuration as you probably do now to specify which style number and keyword list can be used for this currently but that wouldn't adapt as easily to improved lexers.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-11

Yes, this is probably doable but still falls back to the possibly slow SCI_SETKEYWORDS branch. I still haven't heard your opinion on using the binary search when the keyword number gets sufficiently large. IMO it's worth the extra 14 lines of code but you may have a different opinion.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-11
  
  Having an uncertain performance profile (possibly faster for large sets and slower for small sets) which is then avoided with an arbitrary cut off troubles me. Its likely there are better implementations.
  
  BTW, the local variables 'key' and 'value' have poor names.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-11

Sure, can rename the variable names.

To the cut off - it's not arbitrary at all. Let's get formal :-)

Theorem:
Binary search is always faster than the existing algorithm for more than 1000 keywords.

Proof:
Consider the 1000 element list. The binary search will need at most 10 comparisons to find the keyword: log_2(1024) = 10.

Now consider the current algorithm. Let's assume the most favourable scenario for it which is that the keywords are evenly distributed by the first character (in reality the distribution will be non-even so for instance under first character 'a' there will be more keywords than under 'b' etc.). Now let's consider the alphabet size - 2*26=52 (lower/upper-case). Again, this is chosen to be more favourable for the current algorithm, in reality keywords starting with some characters will be missing and the chunks to be searched linearly will be longer. Now with those 1000 elements we get 1000/52=19.2: this is the length of the chunks we have to search linearly. But on average, we'll need to go through just one half of them so about 10 elements. And this is exactly those 10 elements we get when using the binary search. So at about 1000 elements (and we can use 1024 or whatever, the difference will be minimal) binary search starts getting better.

In the above we considered the most favourable scenario for the existing algorithm. In reality, binary search will get faster earlier because of the non-even start-character distribution. So at about 1000 elements binary search is guaranteed to be faster than the existing algorithm.

Q.E.D.

Theorem:
Existing languages won't be affected by this change when SCI_SETKEYWORDS is used for their keywords.

Proof:
There are about 85 keywords in C++, which is about the maddest language in this respect, and even if people add a few more nice-to-have-highlighted words and there are a few more C++ standard revisions, this will still be well below 1000.

Q.E.D.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-11

"Its likely there are better implementations."

Don't really have proof of this I'm afraid but I can't imagine how you could algorithmically find when O(log(n)) algorithm is faster than O(n) algorithm without finding the intersection of the y=log(n) curve and y=x curve. The number 1000 is the solution of log(n) = n / (alphabet_size * 2) and while you could solve this in the code, it doesn't make much sense IMO.

This is a hybrid algorithm similarly to e.g. the fastest real-world sorting algorithm Timsort

http://en.wikipedia.org/wiki/Timsort

but it's up to the developer to decide which of the algorithms to use for which situation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2015-06-12
  
  This analysis shows little understanding of the actual performance of this code. While the complexity difference should make the binary search win with large keyword sets, at smaller sizes like 1000 elements, the constant factors play a large role. Its not possible to determine the point at which a binary search will perform better without a much deeper understanding.
  
  I performed an experiment with a 1092 element keyword list created by combining all the keywords in SciTE's cpp.properties and fortran.properties. This was run against a list of of 2184 words, half from the keyword list and half not. Building with the MSVC compiler version and options normally used to build releases of SciTE, this resulted in the linear search being 38% faster than the binary search.
  
  While we could play benchmark games with different data sets, compilers and optimisation options, its fairly plain that your analysis is insufficient to explain this observed result. I don't want to spend more time on this so, for now, am rejecting the patch due to its poor performance in some cases.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2015-06-12

labels: --> scintilla, performance

status: open --> open-rejected

assigned_to: Neil Hodgson

Priority: 5 --> 3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jiri Techet - 2015-06-12

Thanks for conducting the experiment Neil. To make it completely fair though, the words from the symbol list appearing in the document should be randomized and several measurements should be made because by accident the words could have been selected from the beginnings of the lists so linear search was faster (but really don't waste your time on this, just mentioning it for completeness).

You are right one should take the platform's characteristics into account. Going through arrays linearly is definitely better for cache locality and also allows the compiler to optimize the code and possibly perform vectorization (can't imagine that for the binary search). So yes, the point where binary search starts winning might be slightly off.

But there definitely is one (as the kernel example where the CPU spend in the function goes from 37% to invisible in the trace shows) and even if the limit was set to some really high number to be sure, say 10000, it would help the cases with too many keywords.

And don't worry, I'm not going to prolong the discussion any further. The end :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2015-06-20

status: open-rejected --> closed-rejected
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[PATCH] Use binary search for WordList::InList()

Group

Searches

Help

#1734 [PATCH] Use binary search for WordList::InList()

Discussion