better string library / Discussion / Open Discussion: bstrList editability

gwideman - 2007-03-06

Paul: In using bstrList I found myself wanting to be able to append, insert or delete strings from the list. bstrlib doesn't support this -- bstrLists only come about as a result of "fully supervised" split operation, and are then readable and joinable, but no append/insert/delete.

So I considered adding those functions... until I noticed that expansion leads to realloc which might move the actual tagbstrings, invalidating any existing pointers an app might have to those tagbstrings. So it appears that bstrLists are intended for one-time create-and-populate, thereafter read-only.

I wondered if you had considered an array-list version where the list contains bstrings rather than tagbstrings, and thus would be more amenable to append/insert/delete?

FWIW, I sorely miss the ubiquitous and highly useful TStringList of Delphi (including generic Object pointer associated with each string), and since you've implemented more-or-less Pascal-style strings... :-)

Well, just a thought.

Graham

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Hsieh - 2007-03-06
  
  Ok, ok, since you are the second person to ask, I have changed struct bstrList * to include a ->mlen entry, and indirected entry to be a bstring * pointer. These have been checked into CVS.
  
  I have not written any "list management functions"; and there is a complication here. Bstrlib supports the idea of a "backdoor" through the use of the BSTRLIB_MEMORY_DEBUG macro define. In the end what this lets you do is force Bstrlib to use some other malloc/realloc/free compatible set of functions with the flip of a switch. This is fine so long as there is no synchronization problem with upper level code and Bstrlib's internal code. Ordinarily there is no synch problem, because Bstrlib handles all memory issues inside of its module. However, in special cases, I have found that sometimes its best to manage an array of struct tagbstring top level structs, and allocate each ->data pointer individually at the callsite level. To make this work, what I do, is I allocate via calls to bstr2cstr() (I start by declaring a static empty struct tagbstring, and convert that to an empty char * string) and bcstrfree(). In this way, the malloc/free functions used are the ones inside of the Bstrlib module (other modules may use other malloc/free functions, including the standard one.) In this way all my problems are solved -- I can use different allocators for Bstrlib, and I can manage bstring headers in a very custom way that is not directly supported by the Bstrlib API, and yet I can still use the Bstrlib API in all its glory in all other aspects of the string manipulation.
  
  Now, with struct bstrList, things are different. The list is allocated inside the Bstrlib module essentially at the time it is created, with no obvious incremental auto-growing functionality that is exposed. So even if you went off and made some external utilities to manage these lists, the problem is that at the realloc() step, you would not be able to guarantee that your realloc() was in synch with the one used by Bstrlib (except by external enforcement of your build mechanism.)
  
  There are three possible ways of dealing with this: 1) Don't worry about it (since the scenario is fairly uncommon, except to people like me), 2) Expose bstr__realloc() semantics somehow in the API, 3) Expose a function which concatenates two struct bstrLists.
  
  The first method is what is currently (03/06/2007) in the CVS tree. Because of the inherent problem of desynchronized memory functions being used (which *I* consider to be a big deal, BTW, as I very commonly do this sort of trick) this option is not really desirable. The second method is really gross. It would encourage people to do direct manipulations of bstrings at their lowest level, which I don't want to do. The third method, while cleanest, is a bit of a slippery slope. If I include one such function, then why not include a whole host of others (and do I *need* to also include an allocmin)? At which point this might as well become an API for all sorts of data structures which happen to include bstrings. See, I think that ideally the plan would be to include just the one bstrList concatenation function, then provide all the other list manipulation functions in bstraux. But then this splits the API for struct bstrLists over two modules.
  
  Anyhow, since you are likely to be a consumer of such functionality, I am wondering what thoughts you have on it from your perspective.
  
  --
  Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - gwideman - 2007-03-07
    
    Paul: Thanks for the thoughtful answer. Some observations:
    
    1. You are describing complementary and possibly competing concerns regarding one the one hand a clean and attractively capable "normal user" API, versus the back-door use involving knowledge of internals.
    
    I'm surmising that your comment:
    "I have found that sometimes its best to manage an array of struct tagbstring top level structs, and allocate each ->data pointer individually at the callsite level."
    ... is motivated by situations where you want to optimize speed or memory fragmentation, and then need special allocators either as part of the performance refinement or for troubleshooting.
    
    If I'm right, then it's OK for the "advanced performance" user to see a more complicated API (in effect you are already using a more-complicated API with your trick anyway).
    
    But I may not be seeing the whole picture here.
    
    2. As to what funcionality a bstrList API should offer. I'm highly sympathetic to the "slippery slope" caution :-). That said, my long experience with Delphi (and VB and C#) convinces me that the string array API is so hugely useful that it represents good value-for-effort, and a good plateau. I'll attach the class definition for TStringList in separate posting, but a couple of points:
    
    a) The array API is easy to understand and use with low vigilance.
    
    b) Internally, the underlying TStringList structure is array of pointers (more below) -- convenient for sequential access by loop or random access, but O(N) for insert, delete, and append beyond current alloc.
    Certainly string-intensive apps could use a data structure that's a better fit to what they do, which might entail additional member fields along with each bstring (and thus more custom to the particular app).
    But in many scenarios with modest length lists (including slurp entire text files), the speed rarely seems to matter (and Sort/Find helps) -- in other words it's a good tradeof vs the productivity of the developer not having to worry about managing the data structure.
    
    Ie: It gets at exactly the same productivity increase as bstrings vs plain C-strings.
    
    3) I note a missed parallel between current bstring and bstrList.
    
    We have
    typedef struct tagbstring * bstring;
    why not
    typedef struct tagbstrList * bstrList?
    
    ... and if they were more parallel, does that lead to the a bstrList version of your tagbstring trick for custom alloc?
    
    Graham
    
    ---------
    
    (TStringList info in subsequent posting)
    
    --------
    A Delphi string is a Pascal string in the sense of Length+array of chars, but it's a variable-length allocation, and the runtime manages allocation, reallocation and garbage collection (based on ref counting), plus it has copy-on-write semantics, so no copies for just StringA := StringB;
    So Array of Delphi strings is behind the scenes an array of pointers to the actual string data.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- gwideman - 2007-03-07
  
  //====================================================================
  // Public parts of TStrings and TStringList, annotated
  //====================================================================
  //----------------------------------
  TStrings = class(TPersistent)
  //----------------------------------
  [...]
  public
      //---------------- basics -------------
      property Capacity: Integer read GetCapacity write SetCapacity;
      property Count: Integer read GetCount;
      property Strings[Index: Integer]: string read Get write Put; default;
      property Objects[Index: Integer]: TObject read GetObject write PutObject;
  
      //------------- Nmae=Value functions -------------
      property Names[Index: Integer]: string read GetName;
      property Values[const Name: string]: string read GetValue write SetValue;
  
      //---------------- List manipulation --------------
      procedure Clear; virtual; abstract;
      function Add(const S: string): Integer; virtual;
      function AddObject(const S: string; AObject: TObject): Integer; virtual;
      procedure Append(const S: string);
      procedure AddStrings(Strings: TStrings); virtual;
      procedure Insert(Index: Integer; const S: string); virtual; abstract;
      procedure InsertObject(Index: Integer; const S: string; AObject: TObject); virtual;
      procedure Delete(Index: Integer); virtual; abstract;
      procedure Move(CurIndex, NewIndex: Integer); virtual;
      procedure Exchange(Index1, Index2: Integer); virtual;
      function IndexOf(const S: string): Integer; virtual;
      function IndexOfName(const Name: string): Integer; virtual;
      function IndexOfObject(AObject: TObject): Integer; virtual;
      function Equals(Strings: TStrings): Boolean;
  
      //------------- List as a single string (in or out) -------------
      property Text: string read GetTextStr write SetTextStr;
      property CommaText: string read GetCommaText write SetCommaText;
      property Delimiter: Char read GetDelimiter write SetDelimiter;
      property DelimitedText: string read GetDelimitedText write SetDelimitedText;
      property QuoteChar: Char read GetQuoteChar write SetQuoteChar;
      function GetText: PChar; virtual;
      procedure SetText(Text: PChar); virtual;
  
      //------------- I/O -------------
      procedure Assign(Source: TPersistent); override;
      procedure LoadFromFile(const FileName: string); virtual;
      procedure LoadFromStream(Stream: TStream); virtual;
      procedure SaveToFile(const FileName: string); virtual;
      procedure SaveToStream(Stream: TStream); virtual;
  
      //--------- Misc ------------
      property StringsAdapter: IStringsAdapter read FAdapter write SetStringsAdapter;
      procedure BeginUpdate;
      procedure EndUpdate;
  end;
  
  //----------------------------------
  TStringList = class(TStrings)
  //----------------------------------
  // TStringList inherits TStrings functions and implements actual storage.
  private
      FList: PStringItemList;
      FCount: Integer;
      FCapacity: Integer;
      FSorted: Boolean;
      FDuplicates: TDuplicates;
      FCaseSensitive: Boolean;
      FOnChange: TNotifyEvent;
      FOnChanging: TNotifyEvent;
  public
      //---------------- basics --------------
      // Most of the functionality is defined by TStrings, and implemented by
      // TStrings public virtual functions calling TStringList protected override functions.
      // Ie: See TStrings.
  
      //---------------- List manipulation --------------
      function Add(const S: string): Integer; override;
      function AddObject(const S: string; AObject: TObject): Integer; override;
      procedure Clear; override;
      procedure Delete(Index: Integer); override;
      procedure Exchange(Index1, Index2: Integer); override;
      procedure Insert(Index: Integer; const S: string); override;
      procedure InsertObject(Index: Integer; const S: string; AObject: TObject); override;
      function IndexOf(const S: string): Integer; override;
  
      //---------------- Sort, Find --------------
      // API functionality added by TStringList
      function Find(const S: string; var Index: Integer): Boolean; virtual;
      procedure Sort; virtual;
      procedure CustomSort(Compare: TStringListSortCompare); virtual;
      property Duplicates: TDuplicates read FDuplicates write FDuplicates;
      property Sorted: Boolean read FSorted write SetSorted;
      property CaseSensitive: Boolean read FCaseSensitive write SetCaseSensitive;
  
      //---------------- Events --------------
      property OnChange: TNotifyEvent read FOnChange write FOnChange;
      property OnChanging: TNotifyEvent read FOnChanging write FOnChanging;
  end;
  
  //====================================================================
  // Unedited declarations of TStrings and TStringList
  //====================================================================
  
  { TStrings class }
  
  TStringsDefined = set of (sdDelimiter, sdQuoteChar);
  
  TStrings = class(TPersistent)
  private
      FDefined: TStringsDefined;
      FDelimiter: Char;
      FQuoteChar: Char;
      FUpdateCount: Integer;
      FAdapter: IStringsAdapter;
      function GetCommaText: string;
      function GetDelimitedText: string;
      function GetName(Index: Integer): string;
      function GetValue(const Name: string): string;
      procedure ReadData(Reader: TReader);
      procedure SetCommaText(const Value: string);
      procedure SetDelimitedText(const Value: string);
      procedure SetStringsAdapter(const Value: IStringsAdapter);
      procedure SetValue(const Name, Value: string);
      procedure WriteData(Writer: TWriter);
      function GetDelimiter: Char;
      procedure SetDelimiter(const Value: Char);
      function GetQuoteChar: Char;
      procedure SetQuoteChar(const Value: Char);
  protected
      procedure DefineProperties(Filer: TFiler); override;
      procedure Error(const Msg: string; Data: Integer); overload;
      procedure Error(Msg: PResStringRec; Data: Integer); overload;
      function ExtractName(const S: string): string;
      function Get(Index: Integer): string; virtual; abstract;
      function GetCapacity: Integer; virtual;
      function GetCount: Integer; virtual; abstract;
      function GetObject(Index: Integer): TObject; virtual;
      function GetTextStr: string; virtual;
      procedure Put(Index: Integer; const S: string); virtual;
      procedure PutObject(Index: Integer; AObject: TObject); virtual;
      procedure SetCapacity(NewCapacity: Integer); virtual;
      procedure SetTextStr(const Value: string); virtual;
      procedure SetUpdateState(Updating: Boolean); virtual;
      property UpdateCount: Integer read FUpdateCount;
      function CompareStrings(const S1, S2: string): Integer; virtual;
  public
      destructor Destroy; override;
      function Add(const S: string): Integer; virtual;
      function AddObject(const S: string; AObject: TObject): Integer; virtual;
      procedure Append(const S: string);
      procedure AddStrings(Strings: TStrings); virtual;
      procedure Assign(Source: TPersistent); override;
      procedure BeginUpdate;
      procedure Clear; virtual; abstract;
      procedure Delete(Index: Integer); virtual; abstract;
      procedure EndUpdate;
      function Equals(Strings: TStrings): Boolean;
      procedure Exchange(Index1, Index2: Integer); virtual;
      function GetText: PChar; virtual;
      function IndexOf(const S: string): Integer; virtual;
      function IndexOfName(const Name: string): Integer; virtual;
      function IndexOfObject(AObject: TObject): Integer; virtual;
      procedure Insert(Index: Integer; const S: string); virtual; abstract;
      procedure InsertObject(Index: Integer; const S: string;
        AObject: TObject); virtual;
      procedure LoadFromFile(const FileName: string); virtual;
      procedure LoadFromStream(Stream: TStream); virtual;
      procedure Move(CurIndex, NewIndex: Integer); virtual;
      procedure SaveToFile(const FileName: string); virtual;
      procedure SaveToStream(Stream: TStream); virtual;
      procedure SetText(Text: PChar); virtual;
      property Capacity: Integer read GetCapacity write SetCapacity;
      property CommaText: string read GetCommaText write SetCommaText;
      property Count: Integer read GetCount;
      property Delimiter: Char read GetDelimiter write SetDelimiter;
      property DelimitedText: string read GetDelimitedText write SetDelimitedText;
      property Names[Index: Integer]: string read GetName;
      property Objects[Index: Integer]: TObject read GetObject write PutObject;
      property QuoteChar: Char read GetQuoteChar write SetQuoteChar;
      property Values[const Name: string]: string read GetValue write SetValue;
      property Strings[Index: Integer]: string read Get write Put; default;
      property Text: string read GetTextStr write SetTextStr;
      property StringsAdapter: IStringsAdapter read FAdapter write SetStringsAdapter;
  end;
  
  { TStringList class }
  
  TStringList = class;
  
  PStringItem = ^TStringItem;
  TStringItem = record
      FString: string;
      FObject: TObject;
  end;
  
  PStringItemList = ^TStringItemList;
  TStringItemList = array[0..MaxListSize] of TStringItem;
  TStringListSortCompare = function(List: TStringList; Index1, Index2: Integer): Integer;
  
  TStringList = class(TStrings)
  private
      FList: PStringItemList;
      FCount: Integer;
      FCapacity: Integer;
      FSorted: Boolean;
      FDuplicates: TDuplicates;
      FCaseSensitive: Boolean;
      FOnChange: TNotifyEvent;
      FOnChanging: TNotifyEvent;
      procedure ExchangeItems(Index1, Index2: Integer);
      procedure Grow;
      procedure QuickSort(L, R: Integer; SCompare: TStringListSortCompare);
      procedure SetSorted(Value: Boolean);
      procedure SetCaseSensitive(const Value: Boolean);
  protected
      procedure Changed; virtual;
      procedure Changing; virtual;
      function Get(Index: Integer): string; override;
      function GetCapacity: Integer; override;
      function GetCount: Integer; override;
      function GetObject(Index: Integer): TObject; override;
      procedure Put(Index: Integer; const S: string); override;
      procedure PutObject(Index: Integer; AObject: TObject); override;
      procedure SetCapacity(NewCapacity: Integer); override;
      procedure SetUpdateState(Updating: Boolean); override;
      function CompareStrings(const S1, S2: string): Integer; override;
      procedure InsertItem(Index: Integer; const S: string; AObject: TObject); virtual;
  public
      destructor Destroy; override;
      function Add(const S: string): Integer; override;
      function AddObject(const S: string; AObject: TObject): Integer; override;
      procedure Clear; override;
      procedure Delete(Index: Integer); override;
      procedure Exchange(Index1, Index2: Integer); override;
      function Find(const S: string; var Index: Integer): Boolean; virtual;
      function IndexOf(const S: string): Integer; override;
      procedure Insert(Index: Integer; const S: string); override;
      procedure InsertObject(Index: Integer; const S: string;
        AObject: TObject); override;
      procedure Sort; virtual;
      procedure CustomSort(Compare: TStringListSortCompare); virtual;
      property Duplicates: TDuplicates read FDuplicates write FDuplicates;
      property Sorted: Boolean read FSorted write SetSorted;
      property CaseSensitive: Boolean read FCaseSensitive write SetCaseSensitive;
      property OnChange: TNotifyEvent read FOnChange write FOnChange;
      property OnChanging: TNotifyEvent read FOnChanging write FOnChanging;
  end;
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Hsieh - 2007-03-07
  
  Ok, there is no way I am adding all that to the core of Bstrlib. The most sensible thing that I can think of that I can do is to follow a scheme close to option 2) that I suggested above, but to make it a little cleaner by adding the following functions:
  
      int bstrListAlloc (struct bstrList * l, int msz);
      int bstrListAllocMin (struct bstrList * l, int msz);
      struct bstrList * bstrListCreate (void);
  
  which solves the creation and reallocation problem while maintaining the Bstrlib backdoor. Now, if you really really want a whole string list API, it is buildable from these primitives.
  
  Delphi/Pascal has automatically garbage collected strings. In C this kind of thing is not really possible to do automatically (unless you use the Boehm GC.) When bstrListDestroy() is called on the list, it will call bdestroy() on each of the strings. This suggests that any API which adds bstrings to a struct bstrList, should copy the bstring first. But this can sometimes be a redundant performance drain if, for example, its known that a bstring has no lifetime outside of the lifetime of the bstrList, then you might as well just use it directly, rather than making a copy of it.
  
  The performance problems continue when one considers splitting a list or joining lists together. In order to keep things sane it only makes sense to do whole list operations which necessarily retains list counts of 1 (i.e., at most one bstrList is containing any given bstring.) This means something like a bstrListConcat(bl1, bl2) would either have tofree the bl2 header (which is gross) or make more redundant copies of the bstrings for bl1. The latter is probably the only sensible route to go.
  
  Ok, but perhaps the more important point is that all of this has been solved in the C++ API by using struct CBStringList which is really just a public std::vector<CBString> with some split methods. The point being that no matter what I did in a general API it would be essentially a subset of what struct CBStringList already offers. So the only real value add that a potential C implementation would have is that it could enable a completely reference based implementation, where nothing is copied (this would generally be a lot faster). But then it is not really sensible to present a clean API since it wouldn't have the "closure property" that Bstrlib has (i.e., the property that no use of the API as a sequence of calls to it can lead to UB.)
  
  So the main choices are something reasonably clean and generally usable (with a closure property) that is slow and redundant with the C++ API, or to make something which is fast, but dangerous by exposing a lot of "by reference" programmer management at the higher levels. But the truth is that if speed is the only thing we are after, then in fact we should go an even more extreme version of my original implementation:
  
      struct bstrListFast {
          int qty;
          struct tagbstring entry[1];
      };
  
  which would manage the all bstring headers as a contiguous array, and whose data pointers would *reference* an original bstring, which would be required to have a lifetime that encloses the lifetime of the struct bstrListFast. Of course, I am not going to go with that, since its so ridiculously dangerous. OTOH, the user can easily implement exactly that based on the callback versions of the split functions.
  
  So anyways, I am leaning towards the idea of just supplying the three functions shown above and leaving the problem of implementing string lists as an "exercise to the user". If you have an other insight into this I'd be happy to hear it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - gwideman - 2007-03-07
    
    Ah of course, we're back to the "who owns the string" problem, avoided in Delphi and other places by combination of GC and with the hit of copying minimized by copy-on-write. And as you say, not feasible in C (without special GC).
    And ultimately no one solution that fits all. Hence leads to inclination to have application programmer develop their own list so they're fully aware of the implications.
    
    > which solves the creation and reallocation
    So you were intending bstrListAllocMin() to also realloc, right?
    
    FWIW on: "there is no way I am adding all that to the core of Bstrlib."
    Actually, I was hoping this would show that there's *not that much* to it -- the core is just the obvious CRUD for this particular data structure: Append, Insert, Remove, Delete, Move, Swap plus file I/O (split on newline). My point was that *just* this basic functionality gets used over and over again. (As similar does in VB, C# etc).
    But, no amount of usefulness counters the basic physics of the situation, so to speak.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- gwideman - 2007-03-07
  
  So, continuing to stumble down the path that you've no doubt travelled long ago...
  
  On the speed, safety, simplicity-of-use dimensions bstrlib manages to provide closure (hence a level of safety) with actually reduced complexity-of-use and either minimal performance penalty or performance improvement (and overall vast performance increase if programmer performance is factored in.)
  
  It seems to achieve this by making a distinction between bstrings ("safe") and C strings ("engage brain while using"). It's clear when bstrlib gives you a copy and when it gives you a reference. Many string read or edit functions are provided through bstrlib (so caller doesn't worry about memory mgt). In other cases custom processing that reads bstring data can proceed at full speed because the C-strings within bstrings are exposed, yet accessed via macro based on the bstring, so that the app doesn't have its own char * that might get freed or be hanging around after freeing the bstring.
  
  Trying to get at the same tradeoff with data structures that hold bstrings (eg: string array) seems to be at best challenging and probably there's no clearly winning speed-safety-simplicity tradeoff.
  
  I see that the safety requirement seems to lead to a data structure library policy of "always copy": Ie: copy strings IN to the stucture *and* copy them OUT... no handing out duplicate references to bstrings stored in the structure. This would be safe and serve for some scenarios, but of course precipitate a whole lot of copying, memory fragmentation etc for code that actually has to perform any intensive *use* of the strings, eg: even just using the string table for lookups or whatever.
  
  The non-parallel between bstrlib and a string array-list lib (append, insert, ItemAt[index] etc) is that the latter is missing the ability to interact with bstring items without "getting" them from the list (...and getting them from the list leads to the ugly reference-or-copy issue).
  
  This gap would be filled by providing ways to operate on array list items without "getting" them from the list, which in turn suggests a parallel set of bstring functions that operate on (bstrlist, index), or a bstrlist function that takes a callback function to be performed on item[index] or the entire list (ie: iterator style).
  
  I think I'm convinced not to expect any particular approach to data structures as part of core bstrlib... but with bstrlib factoring out part of the problem and making it easy, the mental spotlight shifts over to thinking through how to make using lots of bstrings a happy experience!
  
  Graham
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bstrList editability

Forums

Help

bstrList editability

bstrList editability

Forums

Help

bstrList editability document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

bstrList editability