On Wednesday 23 January 2008 08:46, Michael Gerdau wrote:
> > This is true, of course, but for the sake of five extra characters
> > in the expression, do you not think it preferable to keep the
> > complete form? It may not really matter here, but if copied
> > elsewhere, the less complete from may just not suffice.
> I'm really not favouring either way. On one hand I do prefer simple
> if not trivial regexps -too many people do have difficulties
> understanding regexps anyway ;)
Well, here I fear we must agree to disagree, to some extent. Certainly,
I am in favour of keeping things simple, but IMO, simplification to the
point of sacrificing correctness may actually be counterproductive; if
anything it must surely lead to confusion.
I'm sure you, Michael, don't have a problem interpreting the regexps we
have been discussing, but for the benefit of other readers, who may be
less `clued-up' than we are, let's review the example in question. The
requirement is to match, and delete, a sequence of characters which are
*not* the assignment operator, and which is immediately followed by one
such operator, from the beginning of a text pattern; (in the event that
the one assignment operator is immediately followed by a second, then
that would become the first character of the remaining pattern, after
deletion of the initial substring).
My first attempt at this, using `?' as an arbitrary regex delimiter,
which says to match the pattern to the basic regular expression, (the
flavour recognised by sed), delimited by the first pair of `?' marks,
and substitute the text delimited by the second pair, (with the middle
`?' being common to both pairs); since the substitute text is the empty
string, this substitutes the empty string in place of the matched part
of the pattern, i.e. it effectively deletes it.
This first attempt clearly isn't correct, for it matches exactly *one*
character which is *not* the assignment operator, (the `[^=]'), which
is followed by an assignment operator, (the second `='), and which is
located at the start of the pattern, (the initial `^'). This matches
an assignment such as `X=anything', with only a single character name
for the variable, but it fails for any variable name of more than one
character in length.
Now, you suggested replacing my original expression with...
but this is also incorrect, because it now matches any sequence of
*zero* or more characters which are *not* the assignment operator, and
followed by one such operator, (the `*' means zero or more repeats of
the preceding expression, in this case the `[^=]'). Thus, this will
match the first `=' character it encounters, even if there is nothing
before it, and so it will incorrectly match an expression such as
`=anything', i.e. an assignment with the variable name missing!
To correct this deficiency, we may add an additional `[^=]'...
and now there must be exactly one character, which is *not* `=', and
followed by zero or more additional such characters, then `='; (this is
literally how the expression is parsed; more succinctly described, it is
equivalent to `match a sequence of one or more characters *excluding*
`=', followed by an `=' character'). This now looks like it should do
the trick, for it now requires a variable name comprising at least one
character, but it will also allow for any longer name. However, this
*still* is not correct, for it will also match `ABC=anything' in a
pattern such as `=ABC=anything', leaving a substituted result of
`=anthying', instead of just `anything'; to fix this properly, we
which you may notice is exactly what I proposed yesterday.
> -, on the other hand I see your point
> as well though I would argue that copying either regexp without
> checking whether it actually suits the new usecase is a bug :)
Well, yes, that's a valid point. However, it isn't just the problem of
maybe copying the ambiguous expression to a context where it is not
appropriate; it's in it's very ambiguity that it fails to convey its
precise intent, so leading to diminished clarity of the code.
> Anyway, my proposal -without actually checking whether it works-
> would be to change the label in the case from
> or something to that meaning.
Hmm. `case' statements don't match on basic regexes, they match on
shell globbing patterns. The above would fail to match any variable
name of less than *two* characters; it needs to be...
to require an initial character which is *not* `=', followed by any
expression containing at least one `='. To preserve sanity, I would
even go further, and attempt to exclude other characters which are not
valid in variable names, by making it more restrictive...
(which isn't perfect, but perfection may be rather difficult to
> The idea is to not get a runtime error from parsing the cmdline.
Sure. IMO, we should do both; use a case expression to filter out
obvious invalid matches, but also use a sed expression which clearly
and unambiguously specifies the intent of the code, even if that is
partly redundant because of `case' matching.