From: Pascal J. B. <pj...@in...> - 2016-06-19 17:55:15
|
Daniel Jour <dan...@gm...> writes: > I'm currently working on the regexp module (I'm still here in case > anyone wondered). I've a few questions / suggestions: > > I think it would be good to have regexp-exec fail when there are more > matches than the number of possible return values and it was specified > to return the matches as multiple values. This would go contrary to the CL specification that says that when a function returns more values than expected, the superfluous values are just ignored: (setf r (truncate 3 2)) ; would fail otherwise… > Currently, regexp-exec then > just returns the results as a list: > > // ... > switch (rettype) { > case ret_values: > if (re_count < fixnum_to_V(Symbol_value(S(multiple_values_limit)))) { > STACK_to_mv(re_count); > break; > } /* else FALLTHROUGH */ > case ret_list: VALUES1(listof(re_count)); break; > // ... > > This could lead to surprises, when e.g. calling: > > (multiple-value-list (match some-pattern some-string)) > > When match returns multiple values, one get's a list with the > matches. But when - because of a different pattern - there are more > matches then suddenly this returns a list with the list of matches, > which could confuse code using the result. It's probably not that > critical as one needs to have a pattern with 127 sub-expressions > or more, though. Indeed, this behavior is bad. IMO, it should always return multiple values, or always return a sequence. Returning groups in multiple values ---------------------------------------- In CL, multiple-values-limit is 128. The maximum number of groupings in POSIX regexp is not limited (only the maximum number of groupings that can be referenced in posix regexp is limited to 9). Furthermore, the caller of regexec specifies the number of matches it expects. regexec will not return information about the remaining groups. These semantics are very compatible: just limit nmatch to multiple-values-limit or the actual number of expected values. (nmatch = 1+nsub, includes the group representing the whole regexp). Returning groups in sequences ---------------------------------------- However, if we wanted to match regexps with more than 127 groups, we'd have to get the resulting group matches in vectors. cl-ppcre returns 4 values, including two vectors to hold the group matches: http://weitz.de/cl-ppcre/#scan In emacs lisp, the results are stored in a (hidden) global variable that can be accessed thru functions such as match-beginning, match-end, match-string, so basically with an internal sequence of unlimited size. On one hand multiple values are a little more practical to use, and it's rare to have more than 127 groups in a regexp. On the other hand, returning vectors is a common API, and allows for any number of groups without limitation. This is also basically what the underlying C API does. In conclusion, it might be better to return the same thing as cl-ppcre:scan: match-start, match-end, reg-starts, reg-ends. > Next, currently regex-compile returns a foreign-pointer, and > regex-exec expects a foreign pointer. Thus I could supply regexp-exec > with a foreign pointer acquire from some other function, resulting in > undefined behavior (because regexp-exec blindly assumes it to be a > regex_t * ). AFAIK the policy for CLISP is to not just crash on such > errors. As solution, I'd wrap the foreign pointer in a (defstruct > pattern pointer). Are there any objections against this? Since the name of the C structure is regex_t, and since it contains one useful slot (re_nsub), perhaps you could name the type regex, and provide a regex-nsub reader. -- __Pascal Bourguignon__ http://www.informatimago.com/ “The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.” -- Carl Bass CEO Autodesk |