Thread: [cx-oracle-users] speeding up TB data download

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello -- I am trying to export data from an Oracle 10 db to a Netezza db. The ORCL db has ~100tables (uncompressed ~ 5TB) with tables which range from a few kb to 1TB.

I am using Python 2.7/cx_Oracle 5.0.4 to download the data as bzipped csv files and then loading in Netezza. The code to download data looks like this:

    =====================
    self.connection = cx_Oracle.Connection(conninfo)
    ...
    cursor = self.connection.cursor()
    cursor.numbersAsStrings = True
    cursor.arraysize = 500
    ...
    cursor.execute(query)

    fil_out = bz2.BZ2File(lower("%s.csv.bz2"%fname), "wb")
    csv.register_dialect("test", delimiter="|", quoting=csv.QUOTE_NONE, escapechar="\\", doublequote=False)
    csv_out = csv.writer(fil_out, "test")

    for row in cursor:
        for n,(pars,cell) in enumerate(zip(col_parser,row)):
        if cell:
            ...
        else:
            ...
        csv_out.writerow(rowparsed)
    cursor.close()
    =====================

The reason the data is being parsed is that I am try to obtain the min and max field size so that I can get the NZ schema to be efficient.  Also, the reason I am using numbersAsStrings so that I can easily get the size of the numbers.

The above method is transferring about 3.5GB/hour (uncompressed) or 0.6GB/hour (bz2 compressed). Using the multiprocessing library, I am able to spawn four processes on a quad-core machine which definitely help!

I was wondering there way any way to speed this up further?

Thanks.