As I've been working on adding more diverse data sets to IGB, I've come to the conclusion that the lack of standardized configuration data for quickloads (and possibly other portions) will become more of a problem. For example, currently IGB uses a set of regular expressions on the annotation name to configure IGB's web lookup for gene model information. This is counter-intuitive.
Additionally, as IGB becomes more complex, the lack of a standard definition makes it difficult to automate the process to check if a quickload directory has the proper information and that that information is valid.
As a first attempt, I've whipped up some XSD (XML definition) files to show where I'm going with this. These are barely a first draft.
A contents.xml file will replace the contents.txt file.
The xml file has basically the same information as contents.txt, but also adds a field for data format (2BIT or BNIB), a count of the number of sequences for a genome, and a version date & time to specify when the directory was last updated.
A sequences.xml file will replace both mod_chromInfo.txt and annots.xml.
Sequences.xml will contain all of the annotation information in annots.xml, but also adds a field for the gene annotation url and name (the information that currently live in igb_default_prefs.xml).
Sequences.xml will also contain the sequence information in mod_chromInfo.txt. Since IGB is moving towards accessing sequencing information from a single file, I thought it would be best if the filename of the sequence file was also included.
The files are in XML as IGB already uses this file format. However, the resultant XML files are not very human friendly. JSON is an alternative to XML that could be considered for this configuration information.
Let me know what you think.
Adam Baxter
Loraine Lab
contents.xsd
sequences.xsd
I think it is a mistake to push QuickLoad further than it already is. Why aren't you building this functionality into DAS/2? In many cases, it likely already has it. Do you really want to create yet another closed custom one off data distribution system? Do you really want to force users to use IGB to access your data? It would be more helpful if you pushed the capabilities of DAS/2 so that people can connect these islands of data with one another and use the application best suited to a particular analysis for the task. You're putting up walls instead of building bridges.
Please realize that my request has nothing to do with the direction of IGB's data connectivity. That's well above my pay grade.
Quickloads is already a one-off custom closed data distribution system. I'm simply trying to standardize it in such a way that it's easier to create and check for correctness.
Going along with this, it would be great if we could allow servers to back each other up. It would be nice to be able to mirror the Bioviz server at multiple locations without confusing IGB. For example, all of them might implement a Whole Sequence load hint for foundation data sets, but IGB would be smart enough not to load them all. It is so incredibly easy to set up a QL site that it would be brilliant if we could mirror them in lots of different places, thus making IGB feel much more responsive and flexible.