Okay, here are distilled rules, which we must, at least stay within

Format of a mimetype
type/subtype; name1=value1; name2=value2

media-type     = type "/" subtype *( ";" parameter )
       type         = token
       subtype    = token
       parameter         = attribute "=" value
             attribute              = token
             value                   = token | quoted-string



Axioms:
the order of the name/value pairs are not important
each name/value pair is separated with whitespaces.
Tokens cannot contain whitespaces
values can be case sensitive
All values, except the last, should end in a ;

Okay, the rules for mime-types are a bit more complex than I originally thought.

1. Should we implement these rules in the rules for the content model, in order to allow people to validate their mimetype-specifications? Ie. to avoid having a content model require that objects used a wrongly formatted mimetype.

Next, to compare if two mime-types are equals.
Basically, compare textual until the first ;
then split the remainder on ;
split each split on =
sort on the split names.
Compare the two split-lists

Now, how would this help the original problem? Steve wanted to have a more specific specification in the data object than in the content model. Generally, we would need to create an inheritance tree for the mime-types.
If the content model requires text/plain, then text/xml should be ok. We will also need to define alike mime types, such as text/xml and application/xml. There is a sizable document about this in
http://tools.ietf.org/html/rfc3023

If the content model declared a mimetype parameter, it should be required in the data objects, but the data objects should be allowed to have additional parameters?
Should the content model have a way to specify that the dataobject should have the exact mimetype, and not having additional parameters?


Should we implement the same lenient rules for format_uri? Ie. should the validator understand about how some format uris can be descendants of each other?


 
This is just some thoughts on the issue. I do feel that the current design, where you can specify a number of mimetypes in the content model, and the object is valid if at least one of them matches, is fine for all usecases I can think of.

Regards






On 12/19/2011 04:45 PM, Stephen Bayliss wrote:
Message
Hi Asger
 
That's a good point.  Presumably a definition as per http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.17 with the media types as per http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7
 
So from that the main type has to go first and my guess is that the order of the parameters is not important.  Though I wonder if there are further levels of complexity; eg could charset be utf-8 and UTF-8 and both be equivalent?  (It looks like utf-8 is the canonical form though for text/xml)
 
Steve
-----Original Message-----
From: Asger Askov Blekinge [mailto:abr@statsbiblioteket.dk]
Sent: 19 December 2011 14:29
To: fedora-commons-users@lists.sourceforge.net
Subject: Re: [fcrepo-user] ECM validation of MIMETYPE

Yes, but we would need to specify some rules then.

charset is not the only subtype allowed, I do believe this is an openended set. I do know people have been using "version" as well.

So, I would need to know how to split a mime-type and if the order of the subtypes are important?

Secondly, you can of course specify multiple form statements in the content model, the requirement is just that ONE of them match. So, specify the various allowed charsets, and one without charset, and you should be safe.

Regards

On 12/15/2011 01:03 PM, Stephen Bayliss wrote:

As far as I can tell, ECM validation of a datastream’s MIMETYPE is strict – the entire MIMETYPE property contents have to match that declared in the content model.

What about the case where one might want to specify the MIMETYPE of a datastream in the CModel, but not the character set?  If I specify MIMETYPE as “text/xml” in the CModel and as “text/xml; charset=UTF-8” in the object, it fails validation.

Would it make sense to only validate charset if it is defined in the content model?

Regards

Steve