I also forgot my m/z range : 300-2000 uma

Moving to BLOBs will require to refactore a little bit the ER diagrams and to define some conventions around the binary structure.
But it could improve the performance by one order of magnitude and reduce the disk space used.

I have here a Perl script that can generate a SQLite file following the ER diagram I called mzDB and that I sent to you in a previous email.
I'am going to create this file with the mzXML I used for the statistics I have described in my last email.

I can give you the file of the generated file when its done.
The ER diagram include BLOB fields for m/z values and intensities.
There is no RTree index because I use the default indexation features of SQLite.
I currently split scans by a 1 uma window but I can change easily this value.

The main drawback of using BLOBs is the need of adding some post-processing code for the range query operation.
So an additional layer on top of the SQL queries is needed to have a usable result.

I'll try some range queries benchmarks on this file.
I think this may be interesting to compare these benchmarks with those obtained using an RTree indexation.

David

Le 02/12/2010 12:01, Sara Nasso a écrit :
I forgot that about the data:

m/z range: 400-1800 Da
MS1 scans: 2130

Sara

----- Messaggio inoltrato -----
Da: Sara Nasso <apeir0n@yahoo.it>
A: proteowizard-mzrtree@lists.sourceforge.net
Inviato: Gio 2 dicembre 2010, 11:52:21
Oggetto: Re: [proteowizard-mzrtree] next week working on it

Hi all,

I saw very similar numbers on a low resolution profile dataset from a controlled sample: LTQ Ion Trap (0.04 Da data resolution along m/z) and 1 spectrum per sec (1 sec resolution along ret time).

I wrote a draft code for building the DB, but, even using the batch to populate it, it takes too much time and it requires too much disk space. I assume we have to move to BLOBs, even if the DB will be less flexible and some of the references will get lost.

Do you have any experiences in that sense?

Tomorrow Francesco and I will have a further look at it, so if you have any suggestions please let us know.

Cheers
Sara
 


Sara Nasso, Ph.D. student
Department of Information Engineering (DEI)
University of Padova
Via Ognissanti 72 35129 PADOVA, ITALY
Voice: +39-049-8277834
Fax: +39-049-8277826





Da: David Bouyssié <david.bouyssie@ipbs.fr>
A: proteowizard-mzrtree@lists.sourceforge.net
Inviato: Gio 2 dicembre 2010, 10:07:08
Oggetto: Re: [proteowizard-mzrtree] next week working on it

Hi,

I have computed the sum of data points on mzXML file acquired from an LTQ-Orbitrap.
The analyzed  sample is a complex mixture of peptides.

The run duration is 120 minutes. It contains 4174 MS1 scans.
The number of total data points is 163554621, so with an average around 39184 peaks per scan.

David

Le 20/11/2010 00:08, Sara Nasso a écrit :
Hi David,

sorry for the late reply, I'm very busy.
I haven't yet. It won't take me so much time, but I couldn't find any free time, maybe during this weekend...

For the number of data points, it depends on the experimental setup and on the data resolution, as far as I know, and it should be:

((mz_f-mz_i)/data_resolution) * (scan_number_f-scan_number_i)*data_density

data_resolution: e.g., 0.001 Da (sampling step along the m/z dimension)

data_density: I only saw low density files (no more than 10%), but mine it's a limited experience...I guess they can reach higher densities, what did you see?

Sara


Da: David Bouyssié <david.bouyssie@ipbs.fr>
A: proteowizard-mzrtree@lists.sourceforge.net
Inviato: Gio 18 novembre 2010, 13:33:46
Oggetto: Re: [proteowizard-mzrtree] next week working on it

Hi !

I have some time to work on this project.

I have some questions ofr Sara:
- have you setup some java classes to test the feasability ? If so can you share them ?
- do you know the order of magnitude of inserts that have to be done on the data table for a big raw file (i.e. the number of data points) ? I think this this could be an isssue.

David

Le 11/11/2010 16:29, Sara Nasso a écrit :
Hi!

sorry for my late reply, but I had teeth surgery, as David knows.

@David: ok, I see why you used it. We thought of leveraging mzML for metadata, but we still have to define how, if testing comes good.

This week I can't work on this project, but next week I will. So, I'll let you know as soon as possible!

cheers
Sara

-------------------------------------------------------------------------------------

Hi Sara,

Actually, this format is really is used to store the mz data of a single 
run. We have other schemas for lab experiments.
The tables instrument and run have a single record stored.
This is not conventional but this is the way I
 find to set the data 
annotations.

The goal of this schema was very close to the one you solved with the Rtree.
I effectively divide the mz acquistion range in slices (by default one 
run_slice for each uma).
Scans are cut to be transformed in "scan slices" and they are linked 
their
 corresponding run_slice.
mz data points (mz_list and intensity_list) are stored in the scan_slice 
in a binary structure (consecutive DOUBLE numbers).

Using the indexation mechanism of SQLite on tables run_slice and 
scan_slice it is thus possible to make fast
 range
 queries
 on the mz data.
However some postprocess mz filtering has to be done on the data 
contained in the retrieved slices to have only the wanted data points.

David

------------------------------------------------------------------------------ Centralized Desktop Delivery: Dell and VMware Reference Architecture Simplifying enterprise desktop deployment and management using Dell EqualLogic storage and VMware View: A highly scalable, end-to-end client virtualization framework. Read more! http://p.sf.net/sfu/dell-eql-dev2dev
_______________________________________________ proteowizard-mzrtree mailing list proteowizard-mzrtree@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/proteowizard-mzrtree

------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev
_______________________________________________ proteowizard-mzrtree mailing list proteowizard-mzrtree@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/proteowizard-mzrtree


------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________ proteowizard-mzrtree mailing list proteowizard-mzrtree@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/proteowizard-mzrtree