#9 Store rope data as a byte array when possible

closed
nobody
None
5
2010-03-13
2010-03-12
Nathan Reynolds
No

The rope template allows for CharT to be defined as a variety of data types from byte-sized to 2-byte or even 4-byte sized. When sizeof(CharT) > 1, many strings could be compacted to save memory.

For example, all ropes containing characters with a value < 0x100 could be compacted so that each character takes only a single byte. This does not mean that each character is converted to UTF-8. It means that the upper bytes are simply thrown away (since they are 0x00 anyway). When the program requires a character, the byte is reinterpret_cast to CharT and the upper bytes are set to 0 again.

Let's say a program has English strings and Japanese strings. The Japanese strings will have characters which require 2-bytes. These strings won't be compacted. The English strings will have characters which require only 1-byte each. These strings will be compacted. Long strings with a single Japanese character might be stored as 3 substrings. 1 compacted substring that comes before the Japanese character. 1 compacted substring that comes after the Japanese character. 1 substring that has the Japanese character.

In a server where rope is defined with a 2-byte CharT, compact strings will save 1/2 of the string data memory. HotSpot JVM did this and saw not only the memory savings but also a performance improvement for applications where performance is memory bandwidth constrained.

Of course, c_str() will force a copy of the rope data to be an array of CharT. If the copy is used as the sole copy of the rope, then it could be reverted back to byte format when delete_c_str() is called.

Discussion

  • Ok. Please, describe _algorithm_ of 'compacting' and recovery of string (assume, chars converted in some multibyte representation): "Вопрос на $0.02".

     
  • If a character requires a multi-byte representation, then this feature can not be used. The standard rope implementation would be used. This feature would only be used if each character only requires a single byte.

     
  • Here's a char_producer that converts from wchar_t into a byte array and then back again as needed.

    class CharProducerUsingBytes : public char_producer<wchar_t>
    {
    public:
    CharProducerUsingBytes(const wchar_t *source, size_t length)
    {
    unsigned char *pos;

    m_data = new unsigned char[length]; // Assumes sizeof(unsigned char) == 1
    pos = m_data; // Use a local variable for optimization speed.

    while (length-- > 0) // Post-decrement used so the logic can compare to 0 and deal with underflow
    *pos++ = (unsigned char) *source++; // Copy each character truncating to use only the lower 8-bits.
    }

    virtual ~CharProducerUsingBytes()
    {
    delete [] m_data;
    m_data = NULL;
    }

    virtual void operator()(size_t startPos, size_t length, wchar_t *buffer) const
    {
    unsigned char *pos;

    pos = m_data; // Use a local variable for optimization speed.

    while (length-- > 0) // Post-decrement used so the logic can compare to 0 without worrying about underflow
    *buffer++ = (wchar_t) *pos++; // Copy each character extending the value so that only the lower 8-bits can be non-zero.
    }

    private:
    unsigned char *m_data;
    };

     
  • CharProducerUsingBytes implmentation

     
  • Fantastic raving.

     
    • status: open --> closed