Menu

#571 TOcDocument::LoadParts/SaveParts cause crashes with UNICODE character set

6.44
closed
1
2024-04-15
2024-03-20
No

Hi,

Recently my team migrated our project from an old version of OWL to OWL 7.0.12 (with UNICODE character set). After the migration we encountered a crash when calling TOcDocument::LoadParts function.

After some investigation we found out that the reason for this crash is that in new version of OWL, the TOcPart name is written and read as TCHAR array instead of char array.

Please, take a look at part of TOcDocument::SaveParts function:

TOcPart& part = *i.Current();
int16 len = int16(part.GetName().Length());

// Write the part name string, pascal style [len]+chars, no 0
//
ok = ToBool(HRSucceeded(stream.Write(&len, sizeof len, &count)) &&
            HRSucceeded(stream.Write((_TCHAR *)part.GetName(), len, &count)) &&
            part.Save(sameAsLoaded, remember));

The issue is that len is equal to a number of wchars (if the UNICODE character set is being used), however ocf::TOcStream::Write method expects a number of BYTES (it calls ISequentialStream::Write). This causes only half of the name to be written (and later read).

Of course, the TOcDocument::LoadParts also has to be adjusted.

Related

News: 2024/04/owlnext-7013-and-64423-updates
Wiki: OWLNext_Stable_Releases

Discussion

1 2 > >> (Page 1 of 2)
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-24
    • assigned_to: Marceli Sokólski
    • Group: unspecified --> 7
     

    Last edit: Vidar Hasfjord 2024-03-29
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-24

    @pralkopodobny wrote:

    The possible fix could look like this [...]

    Your analysis and solution look good to me. But please clarify with comments in the code. And I suggest renaming "len" to "byteCount", as length is often referring to string length (unit count in the character encoding).

    I have assigned this ticket to you as well, so that you can take control of fixing it. Since this is an internal issue and fix, not affecting the API, I have set milestone to 7.

    Again, thanks for your contribution!

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-24

    On second review, I spotted a potential issue with your fix.

    // Write the part name string, pascal style [len]+chars, no 0

    Presumably, "len" here should remain the length of the string in number of _TCHARs (not byte count), so make sure your fix doesn't change that. Which in turn means that LoadParts needs to also convert the length read back in, into byte count, to correctly read the string.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-24

    Note that beyond a simple minimalistic patch, there is an opportunity to improve the code here, thereby also fixing exception-safety issues with this code. Since these functions should return bool on failure, they should handle exceptions within. For example:

    //
    // Private function implementation wrapper providing common exception handling.
    //
    template <typename F>
    auto Try(F functionImplementation, std::source_location loc = std::source_location::current()) -> bool
    {
      const auto function_name = [&] { return owl::to_tstring(loc.function_name()); };
      try
      {
        return functionImplementation();
      }
      catch (const owl::TDiagException&)
      {
        throw; // Let diagnostics through.
      }
      catch ([[maybe_unused]] const owl::TXBase& x)
      {
        WARN(true, function_name() << _T(": TXBase: ") << x.why());
      }
      catch ([[maybe_unused]] const std::exception& x)
      {
        WARN(true, function_name() << _T(": exception: ") << owl::to_tstring(x.what()));
      }
      catch (...)
      {
        WARN(true, function_name() << _T(": unknown exception"));
      }
      return false;
    }
    
    //
    /// Saves the embedded parts to the provided storage.
    //
    auto TOcDocument::SaveParts(IStorage* iStorage, bool sameAsLoaded, bool remember) -> bool
    {
      return Try([&]
        {
          PRECONDITION(iStorage || sameAsLoaded);
          if (!Storage) return true;
    
          // If the storage passed is not the same as the one we loaded, wrap the new one and use it.
          // If we are switching storages, make sure parts are all pulled in now.
          //
          auto newStorage = sameAsLoaded ? nullptr : make_unique<TOcStorage>(iStorage);
          const auto storage = sameAsLoaded ? Storage : newStorage.get();
          if (!sameAsLoaded)
            for (auto i = TOcPartCollectionIter{PartCollection}; i; i++)
              i.Current()->FinishLoading();
    
          // Create a stream for part information.
          //
          auto statstg = STATSTG{};
          const auto rStatstg = storage->Stat(&statstg, STATFLAG_NONAME);
          if (FAILED(rStatstg)) throw TXOle{_T("TOcStorage::Stat failed"), rStatstg};
    
          auto stream = TOcStream{*storage, DocStreamName, true, statstg.grfMode};
    
          // Write TOcDocument data into stream.
          //
          auto byteCountWritten = ulong{};
    
          const auto rPartId = stream.Write(&PartID, sizeof PartID, &byteCountWritten);
          if (FAILED(rPartId)) throw TXOle{_T("Write (ID) failed"), rPartId};
    
          const auto partCount = PartCollection.Count();
          const auto rPartCount = stream.Write(&partCount, sizeof partCount, &byteCountWritten);
          if (FAILED(rPartCount)) throw TXOle{_T("Write (part count) failed"), rPartCount};
    
          for (auto i = TOcPartCollectionIter{PartCollection}; i; i++)
          {
            auto& part = *i.Current();
    
            // Write name as a Pascal-style string (length + characters, no null-terminator).
            //      
            const auto& name = part.GetName();
            const auto length = name.Length();
            const auto rLength = stream.Write(&length, sizeof length, &byteCountWritten);
            if (FAILED(rLength)) throw TXOle{_T("Part write (name length) failed"), rLength};
    
            const auto byteCount = static_cast<ulong>(length * sizeof(tchar));
            const auto rCharacters = stream.Write(static_cast<LPCTSTR>(name), byteCount, &byteCountWritten);
            if (FAILED(rCharacters)) throw TXOle{_T("Part write (name characters) failed"), rCharacters};
    
            // Save part.
            //
            const auto okSave = part.Save(sameAsLoaded, remember);
            if (!okSave) throw TXBase{_T("Part save failed")};
          }
    
          // Done. If we have new storage, commit it and deallocate the old.
          //
          if (newStorage)
          {
            const auto oldStorage = unique_ptr<TOcStorage>{Storage};
            Storage = newStorage.release();
          }
          return true;
        });
    }
    
     

    Last edit: Vidar Hasfjord 2024-03-27
  • Marceli Sokólski

    @vattila wrote:

    Your analysis and solution look good to me. But please clarify with comments in the code. And I suggest renaming "len" to "byteCount", as length is often referring to string length (unit count in the character encoding).

    I'll take that into account when I prepare a fix for this issue.

    @vattila wrote:

    Note that beyond a simple minimalistic patch, there is an opportunity to improve the code here, thereby also fixing exception-safety issues with this code.

    Good idea. I'll do something like what you suggested in a separate commit.

     
    👍
    1

    Last edit: Marceli Sokólski 2024-03-25
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    @pralkopodobny wrote:

    Good idea. I'll do something [about the exception-safety] in a separate commit.

    Yes, it is wise of you to separate the fix for the Unicode issue from further code improvement. We probably should only make the latter changes on the trunk, since they will subtly change the API (by changing which exceptions are propagated), thereby possibly affecting client code. The 7 branch is now stable and should not have any changes that may require client code rewrite.

    Improving exception-safety is applicable to the core OWL library as well, so my Try wrapper may belong in "owl/except.h" for wider reuse. As written, it also requires C++20 (due to the use of std::source_location), which is only supported by the Microsoft compiler, so far.

    For more about exception-safety, see Exceptions and OWLNext.

     

    Related

    Wiki: Exceptions_and_OWLNext

  • Marceli Sokólski

    Hi,

    I committed [r6901].
    It's my first time working with svn so could you please check if everything is ok?

    I locally merged this change with 7.0.12 and tested it in our application. It seems to work.

     
    🎉
    1

    Related

    Commit: [r6901]

  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    @pralkopodobny wrote:

    It's my first time working with svn so could you please check if everything is ok?

    Good, your commit looks OK here. Your revision [r6901] has been committed on the branch correctly, and your commit message is following our conventions (except for missing the period/full stop we like to use, to indicate that the message is not truncated). Congratulations on your first commit — may there be many more!

    Here's further code review:

    • Nice and minimal, hence applicable to branches 7 and 6.44. Good!
    • TOcDocument::LoadParts: Buffer check is wrong. len >= sizeof(name) should be len >= std::size(name) or len >= COUNTOF(name) (for compatibility with 6.44).
    • TOcDocument::LoadParts: Consider increasing the buffer size for 'name', unless 33 is guaranteed to be sufficient (in which case a named constant should ideally be used, not a magic number).
     

    Related

    Commit: [r6901]

  • Marceli Sokólski

    @vattila wrote:

    Congratulations on your first commit

    Thank you :)

    I believe the name limitation is (or at least was) 32 characters (here's old post complaining about this).

    @vattila wrote:

    Buffer check is wrong.

    You're right, I accidentally doubled number of allowed characters in name when unicode character set is used.

    Could you tell me what is preferred convention for named constants (I was thinking about something like this

    static constexpr auto MaxPartNameSize = 32
    

    and then

    _TCHAR name[MaxPartNameSize + 1];
    int16 len;
    
    // Read the part name string, pascal style [len]+chars, no 0
    //
    if (!HRSucceeded(stream.Read(&len, sizeof(len))))
      return false;
    
    if (len >= std::size(name)) {
      // NOTE: Should *never* happen!
      //
      return false;
    }
    
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    static constexpr auto MaxPartNameSize = 32

    Looks good to me, if defined in a class. If as a local variable, start with a lower-case letter. If as an implementation detail in file scope, don't use the static-keyword, but use an anonymous namespace instead, and add a trailing underscore.

    And add a code comment explaining the 32 character limit. Also, document the limit it in the SaveParts function documentation (a Doxygen-style comment, using "\note").

     

    Last edit: Vidar Hasfjord 2024-03-25
  • Marceli Sokólski

    The len check should be fixed in commit [r6902].

     
    👍
    1

    Related

    Commit: [r6902]

  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    The len check should be fixed in commit [r6902].

    Quick review (including nitpicking, so you get the hang of our conventions):

    • Looks functionally good to me.
    • Indentation is 2 spaces.
    • Code comments follow general writing conventions (not programmer speak) and should use normal capitalisation ("API" not "api") and punctuation (you are missing full stop at the end of the comment).
    • Comments should be separated from the commented upon entity by an empty comment line ("//").
    • I like to close namespaces with a comment "// namespace", and anonymous namespaces by "// anonymous namespace". But for a small namespace like this it isn't needed.
    • Follow general writing conventions (not programmer speak) in commit messages as well, including full stop. Note that we have set up Subversion so that you can edit and improve log messages retrospectively.

    Additional notes about unrelated issues:

    • Include DocStreamName in your new namespace as well. Leaving it out makes the code odd and inconsistent. You can rename it with a trailing underscore to follow our conventions. To be diligent to conventions about focused commits, make this change in a separate commit.
    • Take the opportunity to move "using namespace owl" outside of the "ocf" namespace. I don't think it matters, but it would be more consistent with the OWL source, I think. Make this change in a separate commit, though, and for all of the source in one go.
    • Not sure about the DIAG_DECLARE_GROUP, should it be in or out?
     

    Related

    Commit: [r6902]


    Last edit: Vidar Hasfjord 2024-03-25
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    @pralkopodobny, it just occurred to me that we should check the size of the written part name as well, so that the code never writes an invalid file (with a part name length larger that 32). Truncation is not an option, so we'll have to introduce an error here (return 'false')? If so, a WARN statement should be added, also.

     

    Last edit: Vidar Hasfjord 2024-03-25
  • Marceli Sokólski

    I checked on svn website how to change message of a commit. I found out that I should use

    svn propset -r N --revprop svn:log "new log message"
    

    where N is a revision number. So I tried to add a full stop at the end of message by running

    svn propset -r 6902 --revprop svn:log "BUG: OCF: TOcDocument::LoadParts incorrect len check when UNICODE is set."
    

    But I can not see the change on sourceforge. Have I made a mistake?

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    @pralkopodobny wrote:

    I can not see the [log message] change on sourceforge. Have I made a mistake?

    It is a quirk of the Allura software at SourceForge: The Commit pages are static and based on the original log messages. However, you can see your edited message in the Code Commit Log (press the History icon in the Code section).

    So, no mistake, you are doing very well!

    PS. I highly recommend TortoiseSVN, if you are working on Windows.

     

    Last edit: Vidar Hasfjord 2024-03-29
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-25

    @pralkopodobny, regarding tickets, another convention you should follow is to document your changes to the code here in the ticket comments by referring to the revisions and the branches updated and merged. See my recent tickets for how I do it.

    Alternatively, you can add this documentation in the ticket description itself, which is especially beneficial if the comment section is long and it is hard to find the revision documentation in amongst all the posts.

    That said, you are already doing pretty good on this!

     

    Last edit: Vidar Hasfjord 2024-03-29
  • Marceli Sokólski

    Small refactor regarding convention was committed to trunk in [r6903].

     
    👍
    1

    Related

    Commit: [r6903]

  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-26

    @pralkopodobny wrote:

    len >= std::size(name)

    To eliminate warnings, this requires a static_cast. The statement should also be extended to check for len < 0. And it would be a good thing to add a WARN statement here as well. And the buffer can be created after the check, thus closer to use.

    For example:

    -    for (int i = (int)value; i; i--) {
    -      _TCHAR name[MaxPartNameSize_ + 1];
    +    for (auto i = value; i; --i)
    +    {
    +      // Read the part name string (Pascal-style len + chars, no null-termination).
    +      //
           int16 len;
    -
    -      // Read the part name string, pascal style [len]+chars, no 0
    -      //
           if (!HRSucceeded(stream.Read(&len, sizeof(len))))
             return false;
    
    -      if (len >= std::size(name)) {
    -        // NOTE: Should *never* happen!
    -        //
    -        return false;
    -      }
    +      const auto isOutOfRange = len < 0 || len > MaxPartNameSize_;
    +      WARN(isOutOfRange, _T("TOcDocument::LoadParts: Part name length (") << len
    +        << _T(") is out of range [0, ") << MaxPartNameSize_ << _T(']'));
    +      if (isOutOfRange) return false;
    
    +      _TCHAR name[MaxPartNameSize_ + 1];
           if (!HRSucceeded(stream.Read(name, len * sizeof(_TCHAR))))
             return false;
    -      name[len] = 0;  // 0 was not written
    +      name[len] = _T('\0'); // Null-terminate, as 0 was not written.
    

    Note that I removed the old comment here, since this can happen (the file may be corrupt).

    In the future suggested overhaul for exception-safety, the WARN and return statements can be replaced by throwing an exception.

    PS. Note that I presume len == 0 is allowed here. But is it? If an empty string is not allowed, update my range check above. Also, if an empty string is not allowed, it should also never be written, so SaveParts needs to be updated to check for empty string, as well.

     

    Last edit: Vidar Hasfjord 2024-04-11
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-27

    I spotted yet another issue in LoadParts: It performs suspect exception handling for catching the case where TOcStream construction fails. Apparently, in this case, failure is OK and just means that there are no parts (if we trust the code comment). However, the exception handler is too broad. It catches all TXObjComp exceptions and treats them all as OK.

    catch (TXObjComp&) {
      // There is no part related info stream. Thats OK, we then have a document
      // with no parts
    }
    return true;
    

    It may be that this is just fine here, but I doubt it. In any case, it would be better to only handle the exception due to stream open failure and let any other exceptions through unhandled. Fortunately, TXObjComp has an ErrorCode that can be checked:

    catch (const TXObjComp& x)
    {
      if (x.ErrorCode == TXObjComp::xStreamOpenError)
      {
        // There is probably no part related info stream, but that is OK. It just means that we
        // have a document with no parts. So, we assume things are fine, and return successfully.
        //
        return true;
      }
      else throw; // Let other error codes through.
    }
    

    Note that this code improvement needs to be reviewed and tested, to ensure that stream construction failure actually throws TXObjComp with the xStreamOpenError code.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-29
    • summary: TOcDocument::ReadParts/SaveParts cause crashes with UNICODE character set --> TOcDocument::LoadParts/SaveParts cause crashes with UNICODE character set
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-29

    Another issue: File format.

    Currently, and with the committed fix [r6901], the string encoding, and hence file format, used for saving and restoring TOcDocument changes between ANSI and UNICODE build modes. It may be argued that the file format should be stable and unaffected by the build mode, in which case we should do string conversion, if needed, when writing and reading strings.

    To me it would make sense to always read and write Unicode (UTF-8, preferably), since that makes the file format consistent. Writing ANSI text makes the file format dependent on the particular ANSI code page in effect at runtime on the user's particular system.

    However, using Unicode consistently would be a breaking change for client applications using the ANSI build mode.

    What do you think?

    PS. Is there currently any kind of indication to the OCF application about the ANSI code page used to save a document when you try to read it back in?

     

    Related

    Commit: [r6901]


    Last edit: Vidar Hasfjord 2024-04-11
  • Marceli Sokólski

    @vattila said:

    Another issue: File format.

    This is actually a real problem I encountered in our application.
    In OWL 6.20 the document format was always in char.
    After a migration, I have to detect if the old stored data is in old format and manually convert it to a new format if necessary.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-29

    @pralkopodobny wrote:

    [File format inconsistency] is actually a real problem I encountered in our application.

    That is convincing evidence to me that changing the string encoding, depending on the build mode, is the wrong thing to do.

    If we chose to use UTF-8 consistently, we should get pretty good backwards compatibility for applications sticking primarily to ASCII. Does your application use characters outside the ASCII set for its documents?

    It would be good to hear from other OCF users on this (@elotter and @sebas_ledesma, in particular).

    Edit: Here is my proposal for how to deal with this issue:

    • For backwards compatibility, always read and write narrow strings for part names.
    • Always use UTF-8 to write part names.
    • If a part name that has been read back in is valid UTF-8, treat it as such and convert it to ANSI/UNICODE, depending on the build mode.
    • If it is not valid UTF-8 (and hence not valid ASCII, either), assume it is a legacy ANSI string, and try to convert it to ANSI/UNICODE depending on the build mode. If the encoding cannot be deduced and/or the conversion fails, use replacement characters for the part of the string that failed conversion.
     

    Last edit: Vidar Hasfjord 2024-03-30
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-03-29

    See also open ticket "TStorageDocument::GetProperty fails in Unicode build mode" [bugs:#330].

     

    Related

    Bugs: #330

1 2 > >> (Page 1 of 2)

Log in to post a comment.