#313 customizable regular expression support

Completed
closed
None
2
2009-04-28
2006-04-12
Anonymous
No

Hi all there :-)

I would like to have better support for regular
expressions in Scintilla library.

I use Scintilla as a part of larger project (text
editor of my dreams :-)

In the project I need to use regexp outside of the
Scintilla controls as well. Also all the text is
converted to UTF-8 when editted and loadied into the
Scintilla control.

So it would be great to provide a way how to force
Scintilla to use the same regexp library I use (BTW
it's PCRE) instead of the built-in regexp support for
two reasons:
1. Built-in regexp does not support UTF-8 properly.
2. I need to use regexp outside of the Scintilla
control as well. I think to have two different
implementations of regexp in one program is just bad thing.

I don't know if there is any effort in this direction
(I found only some post in some very old dicussion
which noted pcre). If not I'm ready to try to implement
this functionaility myself and post some patch or
something.

Short description of my suggestion (written just after
short investigation into Scintilla's sources so please
don't beat me if I misunderstood something):

(step 1) Rewrite class CellBuffer so that chars and
styles are stored in two separatly allocated blocks of
memory (each one using the same schema: two block of
valid bytes diveded by gap). The current interface of
the class would be kept unchanged and only one new
method added. The new method would be read-only
relative of GetCharRange() - it would return pointer to
the inside of the CellBuffer (after calling gapTo(0)).

Maybe this is the most controversial step of the
suggestion. It can make Scintilla somewhat slower
becuase rule of locality is crossed - styles related to
their chars are far away in memory so CPU cache will be
less effective.

On the other side it would allow simpler calling any
3rd party regexp functions which usually work with
array of chars only. (No more need for class
CharIndexer or temporary buffers). As a side effect it
would also allow to save memory when no styling is used.

Any better idea how to solve the CPU cache problem is
wellcome.

(step 2) Create new type and messages for registering
callback functions implementing the regexp support.
Probably it would be something similar to this:

struct SCRegExp {
... // pointers to functions
};

struct SCRegExp* SCI_GETREGEXPSTRUCT()
SCI_SETREGEXPSTRUCT(struct SCRegExp * api);

Structure SCRegExp would hold pointers to callback
functions provided by caller. The functions would
implement the regexp functionality.

Prototypes of functions pointed by the structure should
be designed so that it whould be very straightforward
to call Posix regexp or PCRE library from them.
(Requires better analyze comparing what I did so far).

Class Document would have one new member added -
pointer to the struct RegExpApi.

Message SCI_GETREGEXPSTRUCT would return pointer to the
actually registered structure (or NULL if built-in
functions would be used and if you don't want to
provide the built-in functions outside of the Scintilla
library).

SCI_SETREGEXPSTRUCT would set the pointer to the
structure. NULL would reset to the bult-in -- see step(3).

(step 3) Rewrite RESearch so that it would be
compatible with new generalized regexp support i.e.
there would be some static const instance of struct
SCRegExp implementing the built-in regexp support.

(step 4) When searching/replacing with the flag
SCFIND_REGEXP set, the apporopriate messages would use
functions in the struct SCRegExp (either set by
SCI_SETREGEXPSTRUCT or in the built-in struct instance).

So to conclude:

If not set otherwise by SCI_SETREGEXPAPI, Scintilla
would still work the same way it does now (it would use
built-in regexp algorithms) so there should be no
compatibility issue. The bult-in Regexp support would
be only altered to new generalised interface. (The
current interface is internal to Scintilla so it's not
an issue).

By setting new regexp functions by SCI_SETREGEXPAPI,
Scintilla would call the registered functions. In most
common case the functions provided by user would
probably just translate its arguments so it could call
regexp functions to 3rd party library (e.g. PCRE or
Posix regexp library).

Sure it requires a lot of changes in Scintilla's
internals but I do believe it would really make
Scintilla library even better then it is already.

Can you tell me if there is any chance of accepting
such changes into the Scintilla project someday? (I
don't want to manage fork of Scintilla project so I
will code only if you answer "Yes" to this question.)

Also as noted above I'm open to any discussion about
the problem.

Mity
<mity[at]morous[dot]org>

P.S. Anyway great thanks for your work on Scintilla
project.

Discussion

  • Neil Hodgson

    Neil Hodgson - 2006-04-13

    Logged In: YES
    user_id=12579

    This should be discussed on the mailing list rather than
    through an RFE and Simon Steele has already commented there.
    The cell structure is not the only issue with the current
    structure as the gap also has to be handled for the RE code.
    Consider the additional costs in moving text to one side of
    the gap to make all the bytes adjacent when running a global
    replace. The *Accessor classes exist to implement things
    like RE searching outside Scintilla.

     
  • Neil Hodgson

    Neil Hodgson - 2006-04-13
    • priority: 5 --> 2
    • assigned_to: nobody --> nyamatongwe
     
  • Neil Hodgson

    Neil Hodgson - 2009-04-28

    The text buffer can now be exposed and there is also a mechanism to incorporate a different regular expression engine into Scintilla.

     
  • Neil Hodgson

    Neil Hodgson - 2009-04-28
    • milestone: --> Completed
    • status: open --> closed
     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks