From: Lynn A. <l_d...@ad...> - 2006-04-17 19:42:56
|
I'm attempting to put together a "hybrid C class/ADT" for a simple USA zip code scanner. I'm very much a re2c newbie and "not the brightest bulb in the box", so feedback appreciated. I realize there is almost certainly lots of room for improvement from re2c veterans. The intent is to be a relatively simple template from which to "clone". It attempts to use semi-standardized notation for regex ..... Match, Pos, and Search. 'Scan' is low level and is used by Match and Search. The (very preliminary) API for ZipCodeRe (abbreviated Zcr) is: - void ZcrConstruct(const char* pzStrToScan, const int lenStrToScan); // Might not require len and calculate this .... or -1 might indicate that Construct should use strlen to calculate ... as a user convenience? - void ZcrDestruct(void); // Empty for this regex. Would apply if malloc/calloc used. - BOOL ZcrBasicScan(void); // Low level scanner in separate ZcrBasicScan.re file. It returns 1=TRUE if a valid zip-code is at the beginning of pzStrToScan. There can be characters after the zip-code, which distinquishes it from 'Match'. This is the ONLY .re file .... every thing else is .c and .h to facilitate debugging. Once the ZcrBasicScan is solid, you should be able to ignore it from that point on. - BOOL ZcrMatch(void); // Uses ZcrBasicScan ... must 'Match' a valid zip-code exactly at the start of null terminated pzStrToScan - int ZcrSearch(void); // Uses ZcrBasicScan to do equiv of strstr. Returns pos/offset of where found, or -1. - inline int ZcrGetMatchPos(void); // Return offset/pos of where Match (or Search) found. Will be 0 for Match, and -1 if not found for Search - inline int ZcrGetMatchLength(void); // Returns length of Match. Will be 5 or 10 for zip-code. - inline char* ZcrGetMatchStr(int* pLen); // Returns ptr to start of Match in the actual buffer supplied to ZcrConstruct. Not null terminated. As a convenience, also returns len of matchStr, for subsequent use with snprintf, strncpy, strncmp, etc. - inline char* ZcrGetMatchCopy(char pzCopyBuf[]); // Returns null terminated copy of Match in user supplied buffer. Is null terminated. Maybe have something like ZcrGetMatchSubstr()??? For best performance, there are almost no parameters for the functions. The ZcrConstruct function has the equivalent of static class variables. My understanding is that this will improve performance of the ZcrSearch function, which will typically be the most time-consuming and biggest factor in having better performance than other libraries such as pcre, boost::regex, boost::xpressive, boost::spirit, etc. Some other comments: * USA-centric zip code, but my thinking is that most people will know what they are, and it is pretty simple, but not complete basic since there is the 12345 and 12345-6789 patterns that match. * My inclination is to use c rather than C++. Data hiding and inline would be easier and more natural with C++, however. The Lessons could include usage with MFC and/or std::string to illustrate how to do this. I suppose there could be a separate ZcrMfcZipCode and ZcrStdZipCode. * I'm not a fan of Hungarian/Simonyi variable naming, except I find it helps clarify strings ... is it a null terminated char[] (_pzStrToScan), an MFC CString (csStrToScan), or std::string (strStrToScan)? * __forceinline is Microsoft specific for C source code, but inline and/or __inline may be portable. ZcrGetMatch* functions benefit from being inline, and reduce the possible performance decline from using the absolutely basic, stripped to the minimum ZcrBasicScan. Otherwise, I'd be inclined to capture the MatchLength, MatchPos, and MatchStr inside the ZcrBasicScan function. - Generally I like to minimize the use of #define for portablility, but there could be something like #ifdef WIN32 #define __INLINE __forceinline #else #define __INLINE #endif * If fastest code required, could modify ZcrSearch and/or ZcrMatch to have the scanner embedded in the function's loop, and duplicate the code used by ZcrBasicScan. Depending on the app, the matchLength, matchPos, and matchStr could be done in ZcrSearch without using the inline assessor functions. Probably not too significant improvement, but "it depends." * My very limited experience is that YYLIMIT works fine without the length. My inclination is to leave it off because the app may supply the length, and then it is redundant and slows the scan (slightly but people will probably be using re2c for performance). #define YYLIMIT pzStrToScan * Might use _DEBUG to have extensive cppunit-like tests to validate it works ok. Or separate "helper" function or separate file? * "Package" will be something like seven files: - ZcrCommon.h: has things like BOOL, TRUE, FALSE, etc. - ZcrBasicScan.re: has bare-bones scanner with nothing beyond the absolute minimum - ZcrBasicScan.h: signature for ZcrScan to be included by ZipCodeRe.c - ZipCodeRe.c: routines for Construct, Match and Search (maybe Replace eventually?) - ZipCodeRe.h: signatures for ZipCodeRegex.c - ZcrTest.c: cppunit-like testing and maybe HiResTimer. Has main and separate project files. - ZcrConsoleApp.c: actual program that exercises ZipCodeRegex - maybe ZcrMfcConsoleApp.c - maybe ZcrStdStringConsoleApp.c * possibly not thread safe??? Haven't looked at it with this in mind. * Sample usage: #include <stdio.h> #include "ZipCodeRe.h" void main(void) { int pos; ZcrConstruct("This has zip-code 12345-6789 in it at pos = 18.", 47); pos = ZcrSearch(); if (pos >= 0) { int len; char* pMatchStr = ZcrGetMatchStr(&len); printf("ZipCode: %.*s at pos: %d\n", len, ptr, pos); } else { printf("Not found\n"); } } |