Menu

#9 Insane memory usage (over 2GB)

v0.6.0 Beta
open-accepted
8
2008-04-01
2008-04-01
exscape
No

On applying a semi-advanced regex on a long string (about 400kB), regexkit gobbles up all available memory, and then some. If I don't quit it in time, OS X keeps swapping and swapping, at least another GB. I haven't seen it finish yet.

Perl finishes the very same regex, on the very same text, in 16 milliseconds (on a slower computer ;).
Test case:

-(IBAction) goButton: (id)sender {
RKRegex *regex = [RKRegex regexWithRegexString:@"<td width=\"60%\">\\s*<a href=\"artist\\.php\\?aid=(\\d+)\">(.*?)</a>\\s*</td>"
options:RKCompileDotAll];

NSData *data = [NSData dataWithContentsOfURL:[NSURL URLWithString:@"http://www.songmeanings.net/artist.php?letter=d"]];
NSString *html = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
if(!data || !html) return;

RKEnumerator *enu = [html matchEnumeratorWithRegex:regex];
NSLog(@"ENTERING LOOP");
while ([enu nextRanges] != NULL) {
}
NSLog(@"DONE");
}

Same problem in 0.6.0 and SVN (as of about 16 hours ago).

Discussion

  • exscape

    exscape - 2008-04-01
    • priority: 5 --> 8
     
  • John Engelhart

    John Engelhart - 2008-04-01
    • status: open --> open-accepted
     
  • John Engelhart

    John Engelhart - 2008-04-01

    Logged In: YES
    user_id=1879513
    Originator: NO

    A quick status update:

    Took a look at the problem. You hit on something I was planning on addressing in the next release of RegexKit. Unfortunately, a bit before I've had time to address it.

    You've hit a snag with the whole UTF-8 / UTF-16 back and forth conversion issue. When I added a NSLog statement to check the strings length and fastestEncoding, I got back:

    string length: 403869 Unicode (UTF-16)

    This means a fairly large string is getting converted back and forth between UTF-8 and UTF-16 for each matching operation (or, in this case, every call to nextRanges). This is pretty painful.

    The following is an extremely crude, not thread safe AT ALL, but effective temporary band-aid for your problem. Again, it's not multi-threaded safe at all, so be sure RegexKit is only being used by one thread at a time (fairly typical, in practice). In the file NSStringPrivate.h, look for the following two lines:

    ----
    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString(NSString * const string) RK_ATTRIBUTES(nonnull(1), const);

    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString_(NSString * const RK_C99(restrict) string) {
    ----

    And replace it with:

    ----
    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString(NSString * const string) RK_ATTRIBUTES(nonnull(1), const);
    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString_(NSString * const string) RK_ATTRIBUTES(nonnull(1), const);

    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString(NSString * const RK_C99(restrict) string) {
    static RKStringBuffer lastStringBuffer;

    if(string != lastStringBuffer.string) { lastStringBuffer = RKStringBufferWithString_(string); }
    return(lastStringBuffer);
    }

    RKREGEX_STATIC_INLINE RKStringBuffer RKStringBufferWithString_(NSString * const RK_C99(restrict) string) {
    ----

    This forms an effective, but very kludgey, "last conversion cache". When I tested this it will now complete successfully, though it takes a few seconds on my 1.5GHz G4 laptop, but it's no longer a memory pig. The reason why it takes a few seconds is because RegexKit must constantly convert the the PCRE UTF-8 range results in to UTF-16 and vice versa, which means scanning the string up to the point of interest, which takes longer and longer. To fix this bottleneck, you can modify the file RKUnicode.m (in the latest SVN checkin) and alter the UTF-8 <-> UTF-16 conversion routines to cache the tail of the last conversion and pick things up from there.

    In RKConvertUTF8ToUTF16RangeForStringBuffer, find the lines that start with the RK_PROBE()... and add

    ----
    RK_PROBE(PERFORMANCENOTE, NULL, 0, NULL, 0, -1, 1, "UTF8 to UTF16 requires slow conversion.");
    const unsigned char RK_STRONG_REF *p = (const unsigned char *)stringBuffer->characters;
    NSRange utf16Range = NSMakeRange(NSNotFound, 0);
    RKUInteger utf16len = 0;

    static NSRange lastUTF16Range, lastUTF8Range;
    static RKUInteger lastUTF16Len;
    static NSString *lastString;
    static const unsigned char RK_STRONG_REF *lastP;

    if((stringBuffer->string == lastString) && (NSMaxRange(lastUTF8Range) <= utf8Range.location)) { utf16Range = lastUTF16Range; p = lastP; utf16len = lastUTF16Len; }
    ----

    And in RKConvertUTF16ToUTF8RangeForStringBuffer, the essentially the same thing, but flipped around for the inverse conversions:

    ----
    RK_PROBE(PERFORMANCENOTE, NULL, 0, NULL, 0, -1, 1, "UTF16 to UTF8 requires slow conversion.");

    const unsigned char RK_STRONG_REF *p = (const unsigned char *)stringBuffer->characters;
    NSRange utf8Range = NSMakeRange(NSNotFound, 0);
    RKUInteger utf16len = 0;

    static NSRange lastUTF16Range, lastUTF8Range;
    static RKUInteger lastUTF16Len;
    static NSString *lastString;
    static const unsigned char RK_STRONG_REF *lastP;

    if((stringBuffer->string == lastString) && (NSMaxRange(lastUTF16Range) < utf16Range.location)) { utf8Range = lastUTF8Range; p = lastP; utf16len = lastUTF16Len; }
    ----

    In BOTH RKConvertUTF8ToUTF16RangeForStringBuffer and RKConvertUTF16ToUTF8RangeForStringBuffer, just before the return() statement, add:

    ----
    lastString = stringBuffer->string;
    lastP = p;
    lastUTF8Range = utf8Range;
    lastUTF16Range = utf16Range;
    lastUTF16Len = utf16len;

    return(....);
    ----

    On my G4 laptop, without the UTF tweaks, it takes about 37.429 seconds, and with the UTF conversion tweaks, it takes about 2.420 seconds according to the NSLog time stamps.

    This should at least allow you to limp by until a better, more permanent solution is developed.

    PS- these changes were made against the last SVN check in. If you're using the 0.6 source, the UTF-8 <-> UTF-16 conversion routines are kept in NSString.m. They were moved to RKUnicode just after 0.6 / latest SVN check in.

    Since the string is in UTF-16 form to begin with, I gave RegexKitLite a shot at it since it can only deal with UTF-16 encoded strings (the opposite of PCRE, which can only deal with UTF-8), and doesn't have to do any kind of UTF8 <-> UTF16 type range conversions:

    ----
    #import <Foundation/NSAutoreleasePool.h>
    #import <Foundation/NSData.h>
    #import <Foundation/NSURL.h>
    #import "RegexKitLite.h"

    void bug(void);

    int main(int argc, char *argv[]) {
    NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

    bug();

    [pool release]; pool = NULL;
    return(0);
    }

    void bug(void) {
    NSString *regex = @"<td width=\"60%\">\\s*<a href=\"artist\\.php\\?aid=(\\d+)\">(.*?)</a>\\s*</td>";

    NSData *data = [NSData dataWithContentsOfURL:[NSURL
    URLWithString:@"http://www.songmeanings.net/artist.php?letter=d"]];
    NSString *html = [[NSString alloc] initWithData:data
    encoding:NSISOLatin1StringEncoding];

    if(!data || !html) return;

    NSLog(@"ENTERING LOOP");
    NSRange matchedRange, searchRange = NSMakeRange(0, [html length]);

    while((matchedRange = [html rangeOfRegex:regex options:RKLDotAll inRange:searchRange capture:0 error:NULL]).location != NSNotFound) {
    searchRange.length -= NSMaxRange(matchedRange) - searchRange.location;
    searchRange.location = NSMaxRange(matchedRange);
    if(matchedRange.length == 0) { searchRange.location++; searchRange.length--;}
    }
    NSLog(@"DONE");
    }
    ----

    [johne@LAPTOP_10_5] rkl% gcc -o matching_test matching_test.m RegexKitLite.m -framework Foundation -licucore -Os
    [johne@LAPTOP_10_5] rkl% ./matching_test
    2008-04-01 14:11:28.352 matching_test[30441:807] ENTERING LOOP
    2008-04-01 14:11:28.384 matching_test[30441:807] DONE

    Or 0.032 seconds. If you need the speed right now, this is your best bet. You should be able to use both RegexKit and RegexKitLite at the same time with no or very few modifications. I don't think any of the methods overlap, but you'd have to double check.

     
  • John Engelhart

    John Engelhart - 2008-04-01

    Logged In: YES
    user_id=1879513
    Originator: NO

    Took a look at why things were taking so long with shark, and the answer popped out. The way that the UTF8 <-> UTF16 conversion is cached is sub-optimal if the regex contains any capture groups.

    Using the last recommended code changes, change the part that caches the last UTF range (just before the return statement in both 8to16 and 16to8 functions) with:

    if((lastString != stringBuffer->string) || ((NSMaxRange(utf8Range) - NSMaxRange(lastUTF8Range)) > 2048)) {
    lastString = stringBuffer->string;
    lastP = p;
    lastUTF8Range = utf8Range;
    lastUTF16Range = utf16Range;
    lastUTF16Len = utf16len;
    }

    This acts as a decent filter to cut down on unnecessary conversions. Timing drops to:

    2008-04-01 15:05:43.792 hog[30918:807] ENTERING LOOP
    2008-04-01 15:05:44.234 hog[30918:807] DONE

    Or 0.442 seconds, which is a decent improvement. It's still sub-optimal, but it's now within the realm of "reasonable".

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.