Scintilla / Feature Requests / #1458 Improve UniqueString and CaseConvert

Neil Hodgson - 2022-10-22

I understand a desire to standardize on one approach to similar issues but don't think all of these are worthwhile. The patch does not apply cleanly to current Scintilla and contains the non-portable __builtin_strlen.

In particular, the use of UniqueString for the undo stack is a low benefit change where there are large potential improvements to make. Many undo actions are just for a single byte (user typing or pressing delete multiple times) and the memory allocation for that is wasteful with an independent allocation of a single byte string. This is aligned up, generally to 16 bytes and allocators commonly use around 2 pointer/size items for management: so 32 bytes of memory to store 1 byte of data. As the undo stack is strictly LIFO, the text of all actions can be stored in one large allocation much more efficiently. Creating a dependence on UniqueString (and an inverse requirement for UniqueString to handle undo actions) is an unnecessary sideways step.

UniqueCopy uses new instead of std::make_unique, perhaps (given the make_unique_for_overwrite comment) to avoid initialisation but new always initializes. Having a templated UniqueCopy implies it will be instantiated over more types but its only used over char.

Recalculating string length is insignificant when there was just a string compare loop and there is about to be an allocation.

Stack allocation for searchThing could be an improvement but I'd want to see measurements as its likely to be swamped by other code.

For the list box, it may be worthwhile simplifying the code at the expense of some memory by shifting to std::string in ListItemData as its unlikely the benefit of the current approach is significant.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2023-06-09
  
  https://github.com/zufuliu/notepad2/commit/29e8b146929ef368e64c408d3b06c8694f6b8819 and https://github.com/zufuliu/notepad2/commit/18aa425697bf6ffa718ae8ef16ec92ffcf994be0 implemented small string optimization for undo action by using structure padding (sizeof(size_t) - 2, so 2 bytes for 32 bit system and 6 bytes for 64 bit system), no custom move constructor is needed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-10-23

UniqueCopy() indeed not benefits much (only avoided memset after call new).

For undo action, maybe it could store short data inline?

class Action { public: ActionType at; bool mayCoalesce; char inlineData[3]; Sci::Position position; char *data; Sci::Position lenData; const char *getData() const noexcept { return data; } } void Action::Clear() noexcept { if (lenData > sizeof(inlineData)) { delete[] data; } data = nullptr; lenData = 0; }

Edit: the Action change does not work (random crash).

Last edit: Zufu Liu 2022-10-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Zufu Liu - 2022-10-28
  
  The crash is due to lacking custom move constructor (with copy constructor, copy/move assignment operators deleted).
  
  Following is a small change (make sizeof(Action) == 4*sizeof(size_t) and default constructor as Action() noexcept = default;) which makes Scintilla.dll 512 bytes smaller (VS2022 x64).
  
  Action-1028.diff
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Neil Hodgson - 2024-01-27
    
    Committed similar to Action-1028.diff as [341a21].
    
    Related
    
    Commit: [341a21]
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-10-29

Following is some changes for CaseConvert.cxx
https://github.com/zufuliu/notepad2/commit/d862dda14bc08a87083f24d67d06db1aa7a7fe64

AddSymmetric() and SetupConversions() changed to member function of CaseConverter to remove extra conversion comparison.

SetupConversions() moved into ConverterForConversion() to remove duplicated code. It might better to use std::unreachable() for the switch.

maxConversionLength increased to 7 to simplify CharacterConversion constructor.

code simplification for AddSymmetric() and SetupConversions().

CaseConvert-1029.diff
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-11-11
  
  // use 7 to remove padding bytes on ConversionString structure
  
  Adding an extra byte so there isn't padding has no justification. The byte isn't going to be used. This appears to me to be motivated by a dumb linter that is pointing out where there is padding. If so, just turn it off.
  
  memcpy(conversion.conversion, conversion_, maxConversionLength + 1)
  
  This is copying garbage uninitialized memory (0xcc in MSVC debug) into conversion.conversion and making the code more fragile to changing requirements. Its a string: use its actual length. Isn't reading uninitialized memory like this undefined behaviour?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-10-29

noexcept on ConverterForConversion() needs to be removed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-11-11

Updated comment for maxConversionLength to // use 7 to make sizeof(ConversionString)==8 and simplify CharacterConversion's constructor.

Use memcpy() for CharacterConversion's constructor simplified the code (as the 8 bytes copy will be inlined), const char *conversion_ parameter is at least 8 bytes and always nul-termiated in both AddSymmetric() and SetupConversions().
zeroed char converted[maxConversionLength + 1]{}; in AddSymmetric() is not needed as UTF8FromUTF32Character() always puts a NUL.
If const char *conversion_ should be treated as untrusted string, then memcpy() can be replace with strcpy() or strncpy().

CaseConvert-1111.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Zufu Liu - 2022-11-28

Changes without touch ConversionString and CharacterConversion structures.

CaseConvert-1128.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Neil Hodgson - 2022-11-28
  
  All the detail code for copying chars around is repetitive and easy to get wrong so it may be better to extend the use of string_view to raise the level of abstraction. Here's a version based on an earlier patch that uses string_view more for decoding the data.
  
  CaseConvert.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Zufu Liu - 2022-11-29
    
    OK, your patch is modern than mine.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-12-01

summary: Some code refactoring and optimization [2022-10] --> Improve UniqueString and CaseConvert
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-12-01

Changed title since the original didn't say anything

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-12-01

Committed updated CaseConvert changes with [50eec5].

Improving undo is a larger project.

Couldn't see any clear benefits in changes to UniqueString / UniqueCopy.

Related

Commit: [50eec5]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-12-01

Group: Initial --> Committed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Hodgson - 2022-12-06

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here is my currently used Action, that avoid allocation for small data.
https://github.com/zufuliu/notepad2/blob/main/scintilla/src/CellBuffer.h#L35

class Action {
    static constexpr Sci::Position smallSize = sizeof(size_t) - 2;
public:
    ActionType at = ActionType::insert;
    bool mayCoalesce = false;
    char smallData[smallSize]{};
    Sci::Position position = 0;
    Sci::Position lenData = 0;
    std::unique_ptr<char[]> data;

    Action() noexcept = default;
    void Create(ActionType at_, Sci::Position position_ = 0, const char *data_ = nullptr, Sci::Position lenData_ = 0, bool mayCoalesce_ = true);
    void Clear() noexcept;
    const char *Data() const noexcept {
        return (lenData <= smallSize) ? smallData : data.get();
    }
};

The new UndoHistory is little slow, compare Notepad2's develop branch (with new UndoHistory) and main branch (use above Action class), replacing all dot to comma for 0N 9V.txt (see [feature-requests:#1502], https://github.com/notepad-plus-plus/notepad-plus-plus/issues/10930#issuecomment-998760967, renaming the file to 0N 9V.log to use monospaed font), it's 50ms slow (tested on i3 and i5).

apply following change to show time in command prompt:

diff --git a/src/Edit.c b/src/Edit.c
index ba0cb979..1ed1f9bd 100644
--- a/src/Edit.c
+++ b/src/Edit.c
@@ -5845,7 +5845,7 @@ bool EditReplaceAll(HWND hwnd, LPCEDITFINDREPLACE lpefr, bool bShowInfo) {
    // Show wait cursor...
    BeginWaitCursor();
    SendMessage(hwnd, WM_SETREDRAW, FALSE, 0);
-#if 0
+#if 1
    StopWatch watch;
    StopWatch_Start(watch);
 #endif
@@ -5887,7 +5887,7 @@ bool EditReplaceAll(HWND hwnd, LPCEDITFINDREPLACE lpefr, bool bShowInfo) {
        }
    }

-#if 0
+#if 1
    StopWatch_Stop(watch);
    StopWatch_ShowLog(&watch, "EditReplaceAll() time");
 #endif
diff --git a/src/Notepad2.c b/src/Notepad2.c
index 138d696c..9e2e28be 100644
--- a/src/Notepad2.c
+++ b/src/Notepad2.c
@@ -519,7 +519,7 @@ BOOL WINAPI ConsoleHandlerRoutine(DWORD dwCtrlType) {
 int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPWSTR lpCmdLine, int nShowCmd) {
    UNREFERENCED_PARAMETER(hPrevInstance);
    UNREFERENCED_PARAMETER(lpCmdLine);
-#if 0 // used for Clang UBSan or printing debug message on console.
+#if 1 // used for Clang UBSan or printing debug message on console.
    if (AttachConsole(ATTACH_PARENT_PROCESS)) {
        SetConsoleCtrlHandler(ConsoleHandlerRoutine, TRUE);
        freopen("CONOUT$", "w", stdout);

Feature Requests: ~~#1502~~

Improve UniqueString and CaseConvert

Group

Searches

Help

#1458 Improve UniqueString and CaseConvert

Related

Discussion

Related

Related

Related