Hi Matt,

I am populating the data table batching all data inserts of a strip and then actually inserting them in a single transaction.
The number of transactions is:

N=number of strips (e.g., 50)
1 for spectrumList table
1xN for Data table
1xN for BBsToSpectrumList table
1xN for BBs table

So, one strip, one transaction (for this file, the DB increases of more or less 80 MB each transaction when N=50).

Now I am trying to use your PRAGMA settings, but I can't see any significant differences. It takes like 15 minutes less, but more than one hour. I added:

stat.executeUpdate("PRAGMA page_size=4096;");
stat.executeUpdate("PRAGMA default_cache_size=200000;");
stat.executeUpdate("PRAGMA cache_size=200000;");
stat.executeUpdate("PRAGMA count_changes = OFF;");

and we gained other 15 minutes. But it is not sufficient: it takes almost 2 hours. I am using Xerial jdbc for Windows 64 bit and I checked out that it is actually using the 64 bit JVM. Am I doing something wrong?

We are talking about 4GB under Windows (only MS1: 72900624 entries in the DATA table) starting from a 780MB (MS1+MS2) mzXML file...it's huge! I am storing m/z and intensities as double values and scan numbers as integer values.

mzRes=0.001;
dataDensity=0.02499296;
maxDataNum=(1400/mzRes)*2130;

expecDataNum=maxDataNum*dataDensity % --> ~74*10^6 (expected data number)
actualDataNum=72900624; % counted by the query: SELECT count(1) FROM DATA
% The expecDataNum and the actualDataNum are consistent.

occupDouble=8; %REAL. The value is a floating point value, stored as an 8-byte IEEE floating point number.
occupInteger=[2 4 6 8];  %Integer. the value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.
integerColumns=2;
doubleColumns=2;
totBytes=doubleColumns*actualDataNum*occupDouble+integerColumns*actualDataNum*occupInteger;
gb=1024^3; (Windows convention)
totGiB=(totBytes./gb)
totGiB=

    1.39    1.67    1.94    2.22

The DB occupies 4 GB (Windows convention).

What's your opinion?

Sara



Da: Matt Chambers <matt.chambers42@gmail.com>
A: proteowizard-mzrtree@lists.sourceforge.net
Inviato: Dom 5 dicembre 2010, 23:13:36
Oggetto: Re: [proteowizard-mzrtree] access times

How are you populating the table? Make sure it is inside a transaction. Also, for bulk inserts to SQLite, try:
PRAGMA synchronous=off;
PRAGMA journal_mode=off;
PRAGMA temp_store=MEMORY;

This won't help file size though. What sizes are we talking about?

If we have to move to BLOB, it will be much harder to propose it as a standard because that will significantly complicate the reading logic. At that point, the only reason I can think of to stick with SQLite is the built-in R*Tree support.

-Matt


On 12/5/2010 2:09 PM, Sara Nasso wrote:
Hi guys,

I tested for range queries times on the DB using the DATA table (no BLOBs).

Problems related to DB size and time required to populate it still need to be solved. I haven't thought about it yet. I guess that populating would be quicker using C-language interface, but I have no clue about how to deal with the size! Probably the only solution is to move to BLOBs. Any ideas?

What to do next?

Thanks,
Sara