Menu

#1458 Improve UniqueString and CaseConvert

Committed
closed
5
2024-03-07
2022-10-14
Zufu Liu
No

https://github.com/zufuliu/notepad2/commit/d4039b7e4fe1509f9ef8816a6b9ea95dfeaca532
Add UniqueCopy() into UniqueString.h, which can be reused for Action::Create(), UniqueStringCopy() and UniqueStringSet::Save().

template <typename T>
inline std::unique_ptr<T[]> UniqueCopy(const T *data, size_t length) {
    std::unique_ptr<T[]> copy(new T[length]); // std::make_unique_for_overwrite<T[]>(length)
    memcpy(copy.get(), data, length * sizeof(T));
    return copy;
}

UniqueStringCopy() could be rewrote as:

UniqueString UniqueStringCopy(const char *text) {
    if (!text) {
        return {};
    }
    return UniqueCopy(text, std::char_traits<char>::length(text) + 1);
}

UniqueStringSet::Save() can be changed into following to avoid compute string length again.

    strings.push_back(UniqueCopy(sv.data(), sv.length() + 1));
    return strings.back().get();

https://github.com/zufuliu/notepad2/commit/43c2718d4cb1531145aeec978368948452521bc8
Replace std::vector<char> with std::unique_ptr<char[]>, this would reduce code size a lot, especially for Document::FindText(), where vector small size optimization may not works well (e.g. UTF-8 case insensitive search requires at least 33 bytes), though VarBuffer like structure can be added for optimize for short search string (e.g. use 256 bytes stack for about 15 characters string).

All platform's implementation for ListBox::SetList() can be changed into strlen() + UniqueCopy() (e.g. Win32 LineToItem).

https://github.com/zufuliu/notepad2/commit/6efaf40ada8804b67c4a3acbc81f00459f6f100c
Optimize Win32 ListBoxX implementation to avoid repeated compute item text length, e.g. inside ListBoxX::Draw() and ListBoxX::GetDesiredRect(), the length is computed twice (strlen() + TextWide() constructor).
The 4 bytes padding after ListItemData can be used to store the text length computed inside AppendListItem(). item text string length compute can be further optimized out by using pointer subtraction inside ListBoxX::SetList(), so here only needs a single call to strlen().

Related

Feature Requests: #1512

Discussion

  • Neil Hodgson

    Neil Hodgson - 2022-10-22

    I understand a desire to standardize on one approach to similar issues but don't think all of these are worthwhile. The patch does not apply cleanly to current Scintilla and contains the non-portable __builtin_strlen.

    In particular, the use of UniqueString for the undo stack is a low benefit change where there are large potential improvements to make. Many undo actions are just for a single byte (user typing or pressing delete multiple times) and the memory allocation for that is wasteful with an independent allocation of a single byte string. This is aligned up, generally to 16 bytes and allocators commonly use around 2 pointer/size items for management: so 32 bytes of memory to store 1 byte of data. As the undo stack is strictly LIFO, the text of all actions can be stored in one large allocation much more efficiently. Creating a dependence on UniqueString (and an inverse requirement for UniqueString to handle undo actions) is an unnecessary sideways step.

    UniqueCopy uses new instead of std::make_unique, perhaps (given the make_unique_for_overwrite comment) to avoid initialisation but new always initializes. Having a templated UniqueCopy implies it will be instantiated over more types but its only used over char.

    Recalculating string length is insignificant when there was just a string compare loop and there is about to be an allocation.

    Stack allocation for searchThing could be an improvement but I'd want to see measurements as its likely to be swamped by other code.

    For the list box, it may be worthwhile simplifying the code at the expense of some memory by shifting to std::string in ListItemData as its unlikely the benefit of the current approach is significant.

     
  • Zufu Liu

    Zufu Liu - 2022-10-23

    UniqueCopy() indeed not benefits much (only avoided memset after call new).

    For undo action, maybe it could store short data inline?

    class Action {
    public:
        ActionType at;
        bool mayCoalesce;
        char inlineData[3];
        Sci::Position position;
        char *data;
        Sci::Position lenData;
    
        const char *getData() const noexcept {
            return data;
        }
    }
    
    void Action::Clear() noexcept {
        if (lenData > sizeof(inlineData)) {
            delete[] data;
        }
        data = nullptr;
        lenData = 0;
    }
    

    Edit: the Action change does not work (random crash).

     

    Last edit: Zufu Liu 2022-10-23
    • Zufu Liu

      Zufu Liu - 2022-10-28

      The crash is due to lacking custom move constructor (with copy constructor, copy/move assignment operators deleted).

      Following is a small change (make sizeof(Action) == 4*sizeof(size_t) and default constructor as Action() noexcept = default;) which makes Scintilla.dll 512 bytes smaller (VS2022 x64).

       
      • Neil Hodgson

        Neil Hodgson - 2024-01-27

        Committed similar to Action-1028.diff as [341a21].

         

        Related

        Commit: [341a21]

  • Zufu Liu

    Zufu Liu - 2022-10-29

    Following is some changes for CaseConvert.cxx
    https://github.com/zufuliu/notepad2/commit/d862dda14bc08a87083f24d67d06db1aa7a7fe64

    1. AddSymmetric() and SetupConversions() changed to member function of CaseConverter to remove extra conversion comparison.
    2. SetupConversions() moved into ConverterForConversion() to remove duplicated code. It might better to use std::unreachable() for the switch.
    3. maxConversionLength increased to 7 to simplify CharacterConversion constructor.
    4. code simplification for AddSymmetric() and SetupConversions().
     
    • Neil Hodgson

      Neil Hodgson - 2022-11-11

      // use 7 to remove padding bytes on ConversionString structure

      Adding an extra byte so there isn't padding has no justification. The byte isn't going to be used. This appears to me to be motivated by a dumb linter that is pointing out where there is padding. If so, just turn it off.

      memcpy(conversion.conversion, conversion_, maxConversionLength + 1)

      This is copying garbage uninitialized memory (0xcc in MSVC debug) into conversion.conversion and making the code more fragile to changing requirements. Its a string: use its actual length. Isn't reading uninitialized memory like this undefined behaviour?

       
  • Zufu Liu

    Zufu Liu - 2022-10-29

    noexcept on ConverterForConversion() needs to be removed.

     
  • Zufu Liu

    Zufu Liu - 2022-11-11

    Updated comment for maxConversionLength to // use 7 to make sizeof(ConversionString)==8 and simplify CharacterConversion's constructor.

    Use memcpy() for CharacterConversion's constructor simplified the code (as the 8 bytes copy will be inlined), const char *conversion_ parameter is at least 8 bytes and always nul-termiated in both AddSymmetric() and SetupConversions().
    zeroed char converted[maxConversionLength + 1]{}; in AddSymmetric() is not needed as UTF8FromUTF32Character() always puts a NUL.
    If const char *conversion_ should be treated as untrusted string, then memcpy() can be replace with strcpy() or strncpy().

     
  • Zufu Liu

    Zufu Liu - 2022-11-28

    Changes without touch ConversionString and CharacterConversion structures.

     
    • Neil Hodgson

      Neil Hodgson - 2022-11-28

      All the detail code for copying chars around is repetitive and easy to get wrong so it may be better to extend the use of string_view to raise the level of abstraction. Here's a version based on an earlier patch that uses string_view more for decoding the data.

       
      • Zufu Liu

        Zufu Liu - 2022-11-29

        OK, your patch is modern than mine.

         
  • Neil Hodgson

    Neil Hodgson - 2022-12-01
    • summary: Some code refactoring and optimization [2022-10] --> Improve UniqueString and CaseConvert
     
  • Neil Hodgson

    Neil Hodgson - 2022-12-01

    Changed title since the original didn't say anything

     
  • Neil Hodgson

    Neil Hodgson - 2022-12-01

    Committed updated CaseConvert changes with [50eec5].

    Improving undo is a larger project.

    Couldn't see any clear benefits in changes to UniqueString / UniqueCopy.

     

    Related

    Commit: [50eec5]

  • Neil Hodgson

    Neil Hodgson - 2022-12-01
    • Group: Initial --> Committed
     
  • Neil Hodgson

    Neil Hodgson - 2022-12-06
    • status: open --> closed
     
  • Zufu Liu

    Zufu Liu - 2024-01-27

    Here is my currently used Action, that avoid allocation for small data.
    https://github.com/zufuliu/notepad2/blob/main/scintilla/src/CellBuffer.h#L35

    class Action {
        static constexpr Sci::Position smallSize = sizeof(size_t) - 2;
    public:
        ActionType at = ActionType::insert;
        bool mayCoalesce = false;
        char smallData[smallSize]{};
        Sci::Position position = 0;
        Sci::Position lenData = 0;
        std::unique_ptr<char[]> data;
    
        Action() noexcept = default;
        void Create(ActionType at_, Sci::Position position_ = 0, const char *data_ = nullptr, Sci::Position lenData_ = 0, bool mayCoalesce_ = true);
        void Clear() noexcept;
        const char *Data() const noexcept {
            return (lenData <= smallSize) ? smallData : data.get();
        }
    };
    
     
  • Zufu Liu

    Zufu Liu - 2024-03-07

    The new UndoHistory is little slow, compare Notepad2's develop branch (with new UndoHistory) and main branch (use above Action class), replacing all dot to comma for 0N 9V.txt (see [feature-requests:#1502], https://github.com/notepad-plus-plus/notepad-plus-plus/issues/10930#issuecomment-998760967, renaming the file to 0N 9V.log to use monospaed font), it's 50ms slow (tested on i3 and i5).

    apply following change to show time in command prompt:

    diff --git a/src/Edit.c b/src/Edit.c
    index ba0cb979..1ed1f9bd 100644
    --- a/src/Edit.c
    +++ b/src/Edit.c
    @@ -5845,7 +5845,7 @@ bool EditReplaceAll(HWND hwnd, LPCEDITFINDREPLACE lpefr, bool bShowInfo) {
        // Show wait cursor...
        BeginWaitCursor();
        SendMessage(hwnd, WM_SETREDRAW, FALSE, 0);
    -#if 0
    +#if 1
        StopWatch watch;
        StopWatch_Start(watch);
     #endif
    @@ -5887,7 +5887,7 @@ bool EditReplaceAll(HWND hwnd, LPCEDITFINDREPLACE lpefr, bool bShowInfo) {
            }
        }
    
    -#if 0
    +#if 1
        StopWatch_Stop(watch);
        StopWatch_ShowLog(&watch, "EditReplaceAll() time");
     #endif
    diff --git a/src/Notepad2.c b/src/Notepad2.c
    index 138d696c..9e2e28be 100644
    --- a/src/Notepad2.c
    +++ b/src/Notepad2.c
    @@ -519,7 +519,7 @@ BOOL WINAPI ConsoleHandlerRoutine(DWORD dwCtrlType) {
     int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPWSTR lpCmdLine, int nShowCmd) {
        UNREFERENCED_PARAMETER(hPrevInstance);
        UNREFERENCED_PARAMETER(lpCmdLine);
    -#if 0 // used for Clang UBSan or printing debug message on console.
    +#if 1 // used for Clang UBSan or printing debug message on console.
        if (AttachConsole(ATTACH_PARENT_PROCESS)) {
            SetConsoleCtrlHandler(ConsoleHandlerRoutine, TRUE);
            freopen("CONOUT$", "w", stdout);
    
     

    Related

    Feature Requests: #1502


Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.