MetadataCatalogService
This is a project proposal for a metadata catalog service based on requirements drawn from the particle physics community, although I believe it would be generally useful for most scientists. Many particle physics experiments produce results files (containing events) which are then categorised and analysed at a later date. Typically, a scientist will want to search for the data files which contain events which match some set of parameters which are relevant to the scientists work. These parameters are recorded as metadata associated with the file. This may be through implicit conventions for naming results files, through both metadata and results being stored in a database or, more commonly, through metadata being stored in a database and the results being stored in an associated storage system.
The general requirements on a metadata system are that:
- databases should be self-describing
- scientists queries should be freeform and easy to express
- queries as expressed should not need to know about the underlying structure or implementation of the databases
- it should be easy to use the same query on several metadata DBs
For instance, the AMI system (http://ami.ln3p2.fr -> dataset search) developed by the ATLAS project implements a model where individual metadata databases are self-describing (utilising a few common tables) and does introspection on the database tables. It implements a version the EGEE defined Metadata Query Language (see Section 4 of the EGEE GLITE METADATA CATALOG USER'S GUIDE). This presents a subset of SQL which basically allows queries of the form:
SELECT attribute_list WHERE (conditions on attributes)
In the MQL definition, all attributes must be namespaced. In the AMI implementation a simplification has been made which effectively flattens the tables into a single table.
Thus a query like:
SELECT dataset FROM dataset s, properties p WHERE s.identifier=p.datasetID AND p.phi > 10
can be expressed as:
SELECT dataset WHERE phi > 10
In OGSA-DAI, this splits into three pieces of functionality which are probably generally useful outside the metadata catalog scenario - and I believe that they can be implemented independently of one another:
DISTRIBUTED JOIN the ability to treat tables from many databases as being within the same database. This has generic application in other contexts (e.g. AstroGrid?) and has many possible implementations. There are typically disadvantages associated with each!
FLATTENED TABLE the ability to flatten all tables in a database into a single virtual table. This probably requires some knowledge of the relationships between tables and their schema. I believe AMI do this via a combination of required tables and values, tables containing mappings and introspection. It is an interesting add on which allows people to "naively query" and therefore may be difficult to implement efficiently.
VERTICAL TABLE INTEGRATION the ability to treat two tables with the same names from different databases as being a single virtual table (we assume in this scenario that if a table has the same name that it has the same type of contents). This requires thought about how to cope with clashes if the two tables have partially overlapping contents. (Is this vertical integration, or is it horizontal?)
The general requirements on a production service are that it should be able to catalogue ~25million datasets, and deal with 100 queries a second (and ~1 write per second).
If time is available it would be useful to attempt to implement one or more parts of this as part of a read-only Metadata Catalog Service demonstrator. It should beeasy to get hold of sample data and queries.
Copyright (c) 2007, The University of Edinburgh.
