Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#434 Drop the constant limit of gap size which slows insertion

closed-accepted
None
5
2012-06-27
2012-06-09
Thomas Meyer
No

Attached patch drastically speeds up search&replace-all actions with millions of changes.

Discussion

  • Could you please show the measure?

    While I think the constant size of increase (1024 as it is now) will
    likely cause quadratic time on a flow of insertion, I wonder why
    ContentManager.insert() was not shown as a bottle neck in my previous
    profile. See tha attachment jProfiler-CPU-after-many-replace.PNG in
    patch #3531538.
    http://sourceforge.net/support/tracker.php?aid=3531538

    And it is known that use of a good tuned factor, for example 0.3, to get
    a new gap size from current capacity solves this kind of quadratic time
    in general. For this case, it can be used with min (1024) and max (1M?)
    limits.
    http://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost

    Please explain (also as a comment in code) why the time period should
    come into play, if it is really wanted.

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-09

    Thanks for the link. Updated patch attached. What you you think about this easier approach?

     
  • Still the measure is wanted.

    The base of new gap size should be (length + len) instead of length.

    With this change, the gap can become larger. So more optimization will
    be wanted on allocating case. Current code seems doing unnecessary move
    in moveGapStart(), copying the contents in gap (which are not really
    contents) in ensureCapacity(), and unnecessary move again in
    moveGapEnd(). It should be done just in this order; allocate the new
    array, copy contents before gap, copy contents after gap.

     
    • assigned_to: nobody --> k_satoda
     
  • Thomas Meyer
    Thomas Meyer
    2012-06-10

    1.) "Still the measure is wanted."

    Some numbers with updated patch attached:

    a.) 226.000 changes - 1.000 lines

    without patch:
    10:35:52 [AWT-EventQueue-0] [debug] AWT-EventQueue-0: Call= 1 Diff=3.360.357.976ns
    with patch:
    10:19:38 [AWT-EventQueue-0] [debug] AWT-EventQueue-0: Call= 1 Diff=2.943.860.319ns

    b.) 11.300.000 changes - 50.000 lines

    without patch:
    10:32:52 [AWT-EventQueue-0] [debug] AWT-EventQueue-0: Call= 1 Diff=283.550.871.735ns

    with patch:
    10:38:05 [AWT-EventQueue-0] [debug] AWT-EventQueue-0: Call= 1 Diff=31.606.097.186ns

    2.) "The base of new gap size should be (length + len) instead of length."

    Fixed. New patch attached

    3.) "With this change, the gap can become larger. So more optimization will
    be wanted on allocating case."

    Yes this its true, but for small files the overhead is minimal, because of factor 0.3 and for big files the maximum is capped to 1MB. This numbers are fine for me, compared to the speed up result reached.

    4. "Current code seems doing unnecessary move
    in moveGapStart(), copying the contents in gap (which are not really
    contents) in ensureCapacity(), and unnecessary move again in
    moveGapEnd(). It should be done just in this order; allocate the new
    array, copy contents before gap, copy contents after gap."

    I think the current gap code works quite good. What do you mean by "It should be done just in this order; allocate the new array, copy contents before gap, copy contents after gap." What new array? The gap is a sliding window in the ContentManager buffer. I'm not sure if I understand you, can you please explain?

    Anyway, can we put these further optimizations in a different patch, or do you want to fix this in one change?

    with kind regards
    thomas

     
  • I committed the elimination of duplicate codes in r21815.

    Now I found that ensureCapacity() does allocate (capacity * 2) which
    should be equivalent to have the factor 1.0 in our idea.

    Then the real problem seems in moveGapEnd(). Though the capacity is
    already expanded in a geometric progression, the quadratic operation
    happens in moveGapEnd(start + gapSize).

    I think there is no reason to have smaller gap size than it is possible,
    and we should eliminate gapEnd field by establishing a new invariant:
    the end of gap is always at (text.length - (length - gapStart)).

    What do you think? Would you please try to implement?

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-17

    Hello k_satoda. Yes, excellent idea. I'll implement it.

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-17

    Updated patch attached.

     
  • Thank you for the quick update.

    I looked through the patch. Here are my comments. Please excuse me for
    their terseness. I think it's better here to be short and clear. I'm
    looking forward to the next version.

    Did the new idea also achieve the expected speed up?

    Please submit a svn patch against current trunk.

    Don't align columns unless it achieve clear benefits. It increase
    spurious diff in review, and increase maintainance costs on renames or
    such.

    Prefer fully spelled gapLength() over gapLen().

    Keep the broken line as it was at the last case in getSegment().

    gapOffset should be just length in _setContent(), and nothing should be
    copied. I couldn't understand what "array has always two extra chars"
    means. If it is really correct, please explain a bit more.

    Do you want to commit the logging code? I don't want it. If you still
    want, please add a comment to show how useful it is.

    Local variables in gapEnd() and gapLen() are both unnecessary.

    I think there is no such strong reason to rename gapStart to gapOffset.
    The diff would be much smaller without the rename, and now it looks
    unbalanced to have moveGapStart(), gapEnd() as its opposite, newStart,
    newEnd, etc.

    The comment "move content after gapEndOld to gapEndNew" says nothing
    except the fact which is already clear with the code itself. Don't put
    such comment.

    Why the "final" was added to capacity parameter of ensureCapacity(),
    leaving many other parameters and local variables untouched?

    The change in JEditBuffer.java looks irrelevant. Please separate it.
    If the change is wanted for some reason, please explain. Feel free to
    create a new patch item if it is an irrelevant issue.

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-19

    Hi k_satoda,

    1.) "Did the new idea also achieve the expected speed up?"
    Yes, looks good - 11.300.000 changes - 50.000 lines:
    18:47:07 [AWT-EventQueue-0] [debug] SearchAndReplace: Call= 1 Diff= 33789582810 Avg=33789582810

    Down to 33s!

    2.) "Please submit a svn patch against current trunk."
    Done. Patch against latest svn revision attached.

    3.) "Don't align columns unless it achieve clear benefits. It increase
    spurious diff in review, and increase maintainance costs on renames or
    such."
    Done.

    4.) "Prefer fully spelled gapLength() over gapLen()."
    Fixed.

    5.) "Keep the broken line as it was at the last case in getSegment()."
    Okay, fixed.

    6.) "gapOffset should be just length in _setContent(), and nothing should be
    copied. I couldn't understand what "array has always two extra chars"
    means. If it is really correct, please explain a bit more."

    The array passed to _setContent method is always two chars longer than the text.length. the two additional chars contains "\n ". I don't know why this is.
    So we use these two extra chars as gap for newly loaded files, and extend the backing array only when you insert more than 2 chars.

    7.) "Do you want to commit the logging code? I don't want it. If you still
    want, please add a comment to show how useful it is."

    This is some debugging stuff I left over. Removed.

    8.) "Local variables in gapEnd() and gapLen() are both unnecessary."
    Left over for better debugging. Removed.

    9.) "I think there is no such strong reason to rename gapStart to gapOffset.
    The diff would be much smaller without the rename, and now it looks
    unbalanced to have moveGapStart(), gapEnd() as its opposite, newStart,
    newEnd, etc."
    I did undo these change.

    10.) "The comment "move content after gapEndOld to gapEndNew" says nothing
    except the fact which is already clear with the code itself. Don't put
    such comment."
    Okay, removed.

    11.) "Why the "final" was added to capacity parameter of ensureCapacity(),
    leaving many other parameters and local variables untouched?"

    I'm not sure anymore why I did this :-) Removed again.

    12.) "The change in JEditBuffer.java looks irrelevant. Please separate it.
    If the change is wanted for some reason, please explain. Feel free to
    create a new patch item if it is an irrelevant issue."

    This change is needed as JEditBuffer.loadText() calls contentMgr.remove(0, 0) before setting the backing array of UndoManager via _setContent(),so ContentManager.text is null. But moveGapStart() called from contentMgr.remove() now needs a set text array to determine it's length and so the gap length. Without this change we get a NPE in contentMgr.remove().

    The correct thing to do was to only remove content from the buffer in JEditBuffer.loadText(), when any text is already in it.

     
  • Thank you for the quick update again. The v5 patch looks close to the
    goal.

    > The array passed to _setContent method is always two chars longer than the
    > text.length. the two additional chars contains "\n ". I don't know why this
    > is.

    You have probably [Global Options] > [Text Area] > [Hide final end of
    lines (if any)] enabled. It is enabled by default, and they are loaded
    from disk but dropped from buffer contents.

    > So we use these two extra chars as gap for newly loaded files, and extend
    > the backing array only when you insert more than 2 chars.

    Setting just gapStart = length should do that, and that is also OK for
    the case length == text.length . I don't see the need of the comment
    and branch.

    > 12.) "The change in JEditBuffer.java looks irrelevant.
    (snip)
    > This change is needed as JEditBuffer.loadText() calls contentMgr.remove(0,
    > 0) before setting the backing array of UndoManager via _setContent(),so
    > ContentManager.text is null.

    Then let's establish another invariant to keep things simple:
    text != null
    An optimal implementation will be:
    private static final char[] EMPTY_TEXT = new char[0];
    private char[] text = EMPTY_TEXT;
    And put a guard on _setContents() for a case someone passes null as the
    text. Just an assert(text != null) is likely enough since ContentManager
    is an internal class.

    I will be more happy if you would write a proposed log message and an
    entry in CHANGES.txt, as a potential new committer in the future.

    FYI, as I said before, the allocating case in prepareGapForInsertion()
    can be a bit more optimized. I'll try that after this patch and show it
    as a separate commit.

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-21

    Updated patch attached.
    Proposed message for CHANGES.txt in section "Miscellaneous":

    - Always use the complete remaining buffer size in ContentManager as gap and
    drop the artificial limitation to 1024 chars.
    The buffer is still expanded in a a geometric progression, but the
    quadratic operation in moveGapEnd has been removed.
    This change greatly speeds up big search&replace-all operations.
    (#3533838 Kazutoshi Satoda, Thomas Meyer)

    Proposed commit message:
    Drop artificial limitation of gap to 1024 chars and remove quadratic operation in moveGapEnd(). #3533838

     
  • Now the patch looks acceptable for me. I'll commit it in a few days.

    However I have some points I'd like to tweak. I'll do them and the
    previously mentioned optimization about allocating case after accepting
    the patch. If you find any problems on them, please let me know. Also,
    updates for them by yourself are also welcome.

    I see this change as a fix for a scalability bug. Then I'll put the
    change entry in Bug fixes, and will post a merge request for 5.0.x after
    soaking it in trunk for a few days.

    Since the change entry is presented for users, it seems a bit too
    verbose in technical terms (class name, method name, etc). It will
    become better if it can be written with UI terms.

    I think there is no need to make a new char[0] for each ContentManager
    instance. I still prefere a more optimal implementation as shown in my
    previous comment.

    I think the invariant, text != null, should be justified in
    ContentManager itself, without mentioning a specific caller in comment.
    Not having this invariant was rather a surprise for me.

    A guard for null text passed for _setContent() is still wanted.

     
  • Thomas Meyer
    Thomas Meyer
    2012-06-22

    1.) "Since the change entry is presented for users, it seems a bit too
    verbose in technical terms (class name, method name, etc). It will
    become better if it can be written with UI terms."

    Okay. Maybe we could use my proposed text for Changes.txt as commit message and
    use something like: "Optimize runtime for large search&replace-all operations"
    as message for Changes.txt

    2.) "I think there is no need to make a new char[0] for each ContentManager
    instance. I still prefere a more optimal implementation as shown in my
    previous comment."

    Okay, I see. My code allocates a new Object array for each ContentManager
    object. I didn't think of that. So your code is the better solution.

    3.) "I think the invariant, text != null, should be justified in
    ContentManager itself, without mentioning a specific caller in comment.
    Not having this invariant was rather a surprise for me."

    I'm fine with the definition of a new constant like EMPTY_TEXT. But I have a
    different opinion on the initial setup of ConstantManager.text. Either
    ContentManager is an internal class or not. So we should treat ContentManager
    as an internal class and we should just fix the caller of ContentManager in
    Buffer/JEditBuffer.

    4.) "A guard for null text passed for _setContent() is still wanted."

    Same here. Either "internal" class or check all arguments passed to a
    public method. Also I once learned to not check arguments to public methods
    with asserts.

     
  • > Maybe we could use my proposed text for Changes.txt as commit
    > message and use something like: "Optimize runtime for large
    > search&replace-all operations" as message for Changes.txt

    The proposed new changed entry sounds too abstract to distinguish with
    other upcoming changes. I'll take the middle between it and the previous
    one. I'll commit with the followings.

    CHANGES.txt (Bug fixes):
    - Fixed unreasonable quadratic operations which could be observable on
    big search&replace-all. (Patch #3533838 by Thomas Meyer)

    Log:
    Dropped the artificial limitation to 1024 chars on extending the gap
    in ContentManager. Now it always use the complete remaining buffer which
    is expanded in a geometric progression. This change greatly speeds up
    big search&replace-all operations in which case copying in moveGapEnd()
    exhibited quadratic operation. Now it is amortized linear.
    (Patch #3533838 by Thomas Meyer, tweaked by me)

    > Either ContentManager is an internal class or not. So we should treat
    > ContentManager as an internal class and we should just fix the caller
    > of ContentManager in Buffer/JEditBuffer.

    I think remove(0,0) should be a valid call for any state of
    ContentManager. If you don't think so, please explain what's wrong and
    to be fixed on callers.

    > Either "internal" class or check all arguments passed to a
    > public method. Also I once learned to not check arguments to public
    > methods with asserts.

    ContentManager is internal, and then we don't check the arguments.
    Assertion is applicable on such case. It is a very lightweight debugging
    facility and also helps readability of the code. Without this, someone
    may worry about possible violation of the invariant.

     
    • summary: Dynamically adjsut gap size on insert speed --> Drop the constant limit of gap size which slows insertion
    • status: open --> closed-accepted
     
  • Accepted as r21885.