Batch size limit
Brought to you by:
iloveopensource
It would be great if we could be able to specify a
limit on the number of rows that openDbCopy will
attempt to copy in a single batch, using the copy plugin.
Larger batches require more memory. I recently
attempted to migrate a 1.5 GB table and that was a bit
too much for my server to handle, as openDbCopy appears
to load the entire table in memory before moving the
data to the destination.
I needed to split my table in 30 batches in order to
make openDbCopy work. Defining those 30 batches in
openDbCopy was very time consuming of course.
Logged In: YES
user_id=877380
I'll give it a shot next week, when I'll be working on a
server with 4 GB RAM in it.
I haven't analysed your code in detail (and haven't worked
with java for 6-7 years, so I doubt I'd be able to anyway..)
but I would suggest an alternative technique where you get
the data row-by-row, collect NN rows and then batch them
into the destination db. This is obviously slower, but this
alternative access method would only be used if required, if
the user indeed has specified a limit (per-table option) and
that limit is below the number of rows in the table.
Another option perhaps would be to check if the source RDBMS
does support the LIMIT function, and use that if it does.
A third way is to work with filters, you could allow for
several filters per table, where each filter would select
disjunct sets of data (it would be the users responsibility
to check this) and then run one batch per table. Currently,
the filter functionality is not usable in my case since it
involves too much work.
I had the idea to set up a single filter and then copy/paste
the relevant sections of the job XML file, only changing the
filter spec. Would this work? The XML file is huge, as it
contains all the mappings etc per plugin instance, so it was
a bit difficult to get the overview.
I don't think you can let the application decide the max
number of rows to be fetched, I guess different drivers will
allocate memory differently so it would be impossible to
know in advance how much memory a row in a given table would
use (even more difficult when you have varchar/binary columns).
I would need this for MSSQL and Oracle, but as you say, this
functionality should be RDBMS independant if possible.