Menu

stringByMatching... seg fault

mchartier
2007-10-31
2013-04-24
  • mchartier

    mchartier - 2007-10-31

    Is there a maximum size the input string can be to the stringByMatching:replace:withReferenceString method?

    With smaller input strings, it works fine. However with an input NSString that is ~30k, the method consistently seg faults.

    Here is a snippet of the code:

    NSXMLDocument* doc = [[NSXMLDocument alloc] initWithContentsOfURL:url options:NSXMLDocumentTidyXML error:&err];                                                                       
    NSString* preXML = (NSString*)CFXMLCreateStringByEscapingEntities(NULL, (CFStringRef)[doc XMLStringWithOptions:NSXMLNodePrettyPrint],NULL);
        NSString* elementMod    = @"<b>${1}</b>";

        NSString* formattedXML    = [preXML stringByMatching:@"(&lt;/?\\w*[^\\&]*&gt;)" replace:RKReplaceAll withReferenceString:elementMod];

    Any clues???

     
    • John Engelhart

      John Engelhart - 2007-11-01

      Off the top of my head, I'd say you're running in to a stack size issue.

      You mentioned you were working on a quick look plugin, so that means 10.5.  You can use the class method [NSThread stackSize] to check the size of stack for the thread that you're executing under.

      RegexKit uses the stack extensively for speed, especially when doing bulk replacements.  Unfortunately, what happens when you walk off the end of the stack is sorta undefined, and there's no portable, simple way to determine how much stack space is left (10.5 finally added stackSize, but there's no methods like 'stackRemaining' or 'stackUsed').

      Under 10.4, threads had a default stack size of 512K.  RegexKit uses alloca() to get its stack allocations, and should fail gracefully or fall back to malloc() if alloca() returns NULL.  Whether or not alloca() ever returns NULL is sorta iffy (which involves digging in to the guts of the compiler, but my hunch is it never does, it always succeeds).  Let me poke around at a few things and see if something obvious pops out.  The solution will involve tweaking the allocation strategies for certain things, or adding work arounds to sense the stack size and choose accordingly.

      alloca() is such a huge win for speed that it's worth the troubles.  A 'call' (its more of a compiler intrinsic) to alloca does it's work and 'returns' before even the first instruction of the first line of code of 'malloc' even executes, which says nothing about actually finding and allocating a piece of memory (alloca essentially adds the amount requested to the stack pointer, returning the address before the add took place).

       
      • mchartier

        mchartier - 2007-11-01

        Thanks for the response.

        For the record, I lifted the code out and placed it into a standalone app on 10.5 and get the same fault, so we can eliminate the quick look portion of the project.

        I understand your use of the stack for performance reasons, but doesn't that decision severely impact the usefulness of your framework? I mean, as a user of the framework and specific methods, I have no idea what stack overhead each particular method might require. So when I make this call, I am always rolling the dice, either it will work, or it will seg fault.

        I would think that the framework should not crash regardless of what length string it's dealing with. I for one would rather sacrifice some speed for robustness.

         
        • John Engelhart

          John Engelhart - 2007-11-01

          Why don't you send me a copy of the stack trace, maybe that'll have some clues in it.  jengelhart at users dot sourceforge dot net.

          As to the usefulness question, no, not really.  The requests for space are done such that if alloca can not fulfill the request, then it uses an alternative allocation method (in regexkit's case, it creates an autoreleased NSMutableData object of the required size which simplifies memory management / leaks issues).  PCRE itself also makes heavy use of the stack because the regex matching algorithm is recursive, so some seemingly simple patterns can consume vast amounts of memory (see http://regexkit.sourceforge.net/Documentation/pcre/pcrestack.html which uses an XML parsing pattern as an example coincidently).  Depending on the specific method and needs in question, I may start with a large size stack structure (say and the entry of a method a "NSRange matches[4096]") that when it exceeds the fixed array size it makes a malloc allocation of twice the size, copies the old work over, resets the number of free slots left, and keeps going until it needs more memory... which it then repeats the malloc/copy/continue cycle.

          As for not crashing, I completely agree.  I've thrown some huge files (600-700K) at it that would search HTML files for tags with font=".*" and rewrite those tags with updated information, among other things.  In the particular care of the method we're talking about, stringByMatching:replace:withReferenceString:, not much stack space is used.  As a matter fact, off the top of my head, the structures that it uses are fairly beefy static arrays declared at the top of a function, and if it exceeds the fixed array size, it copies everything over to a larger dynamically allocated block of memory.  The array is essentially an array of NSRanges and a pointer to the string in question, and it builds a up a sequence of "append to the string we're creating the characters from this string at location x of length y" and it's ~ 16 bytes for each "copy instruction".  Once it's reached the end, it then knows the length of the final, assembled string and makes a single request to malloc for that size and does the required copies.  So in reality, the size of the string isn't what's driving stack usage, it's the number of match and replacements.  And each match and replace only consumes a handful of bytes to keep track of.

          Sooo.  the short answer is things are written so that a static stack allocation would cover 99% of all usage cases, and switch to traditional memory allocation when it exceeds the fixed sizes, and in all cases it follows a stack -> malloc -> fail (return null, throw exception) fallback strategy.

          It's actually fairly robust.  The code itself targets two different API's (c based core foundation and regular objective-c based foundation) and works on two different implementations of foundation (apple and gnustep), on little and big endian (ppc, x86), 32 and 64 bits, and different OS's (whatever gnustep is running on). And the multithreading tests execute all the individual unit tests sequentially on up to 20 threads concurrently while constantly having the regex cache flushed randomly and up with zero memory leaks (or crashes) in the end on all platforms.  This obviously doesn't do you any good with your problem, :), I'm just saying I've tried to really hammer it and put a lot of effort in to it.  As the 0.3 version indicates, it's still new, and you're the first person to post, so not so much wide spread usage yet.  :)

           

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.