Thread: [cx-oracle-users] speeding up TB data download
Brought to you by:
atuining
From: Chuck W. <chu...@ch...> - 2010-07-26 03:44:37
|
Hello -- I am trying to export data from an Oracle 10 db to a Netezza db. The ORCL db has ~100tables (uncompressed ~ 5TB) with tables which range from a few kb to 1TB. I am using Python 2.7/cx_Oracle 5.0.4 to download the data as bzipped csv files and then loading in Netezza. The code to download data looks like this: ===================== self.connection = cx_Oracle.Connection(conninfo) ... cursor = self.connection.cursor() cursor.numbersAsStrings = True cursor.arraysize = 500 ... cursor.execute(query) fil_out = bz2.BZ2File(lower("%s.csv.bz2"%fname), "wb") csv.register_dialect("test", delimiter="|", quoting=csv.QUOTE_NONE, escapechar="\\", doublequote=False) csv_out = csv.writer(fil_out, "test") for row in cursor: for n,(pars,cell) in enumerate(zip(col_parser,row)): if cell: ... else: ... csv_out.writerow(rowparsed) cursor.close() ===================== The reason the data is being parsed is that I am try to obtain the min and max field size so that I can get the NZ schema to be efficient. Also, the reason I am using numbersAsStrings so that I can easily get the size of the numbers. The above method is transferring about 3.5GB/hour (uncompressed) or 0.6GB/hour (bz2 compressed). Using the multiprocessing library, I am able to spawn four processes on a quad-core machine which definitely help! I was wondering there way any way to speed this up further? Thanks. |
From: Anthony T. <ant...@gm...> - 2010-07-26 13:55:40
|
Hi, Just a suggestion: avoid the parsing completely if you can do it. If all you're trying to do is calculate the maximum size of fields, you can easily write something that does this: select max(length(<COL_NAME>)) from <TABLE_NAME> or for numbers select max(length(to_char(<COL_NAME>)) from <TABLE_NAME> This sort of query is far more efficient and can be done independently of your extracts. You can quite easily generate all of these queries by reading from user_tab_columns, all_tab_columns or dba_tab_columns. If you need help with that, let me know. Anthony On Sun, Jul 25, 2010 at 9:44 PM, Chuck White <chu...@ch...> wrote: > Hello -- I am trying to export data from an Oracle 10 db to a Netezza db. The ORCL db has ~100tables (uncompressed ~ 5TB) with tables which range from a few kb to 1TB. > > I am using Python 2.7/cx_Oracle 5.0.4 to download the data as bzipped csv files and then loading in Netezza. The code to download data looks like this: > > ===================== > self.connection = cx_Oracle.Connection(conninfo) > ... > cursor = self.connection.cursor() > cursor.numbersAsStrings = True > cursor.arraysize = 500 > ... > cursor.execute(query) > > fil_out = bz2.BZ2File(lower("%s.csv.bz2"%fname), "wb") > csv.register_dialect("test", delimiter="|", quoting=csv.QUOTE_NONE, escapechar="\\", doublequote=False) > csv_out = csv.writer(fil_out, "test") > > for row in cursor: > for n,(pars,cell) in enumerate(zip(col_parser,row)): > if cell: > ... > else: > ... > csv_out.writerow(rowparsed) > cursor.close() > ===================== > > The reason the data is being parsed is that I am try to obtain the min and max field size so that I can get the NZ schema to be efficient. Also, the reason I am using numbersAsStrings so that I can easily get the size of the numbers. > > The above method is transferring about 3.5GB/hour (uncompressed) or 0.6GB/hour (bz2 compressed). Using the multiprocessing library, I am able to spawn four processes on a quad-core machine which definitely help! > > I was wondering there way any way to speed this up further? > > Thanks. > > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://ad.doubleclick.net/clk;226879339;13503038;l? > http://clk.atdmt.com/CRS/go/247765532/direct/01/ > _______________________________________________ > cx-oracle-users mailing list > cx-...@li... > https://lists.sourceforge.net/lists/listinfo/cx-oracle-users > |
From: Chuck W. <chu...@ch...> - 2010-07-26 16:46:46
|
Thanks for your response. Can you please send me examples which query those tables (user_tab_columns, all_tab_columns or dba_tab_columns)? ---- Anthony Tuininga <ant...@gm...> wrote: > Hi, > > Just a suggestion: avoid the parsing completely if you can do it. If > all you're trying to do is calculate the maximum size of fields, you > can easily write something that does this: > > select max(length(<COL_NAME>)) from <TABLE_NAME> > > or for numbers > > select max(length(to_char(<COL_NAME>)) from <TABLE_NAME> > > This sort of query is far more efficient and can be done independently > of your extracts. You can quite easily generate all of these queries > by reading from user_tab_columns, all_tab_columns or dba_tab_columns. > If you need help with that, let me know. > > Anthony |
From: Anthony T. <ant...@gm...> - 2010-07-26 17:42:06
|
Sure. You want something like this -- untested but it should give you the right idea. connection = cx_Oracle.Connection(....) columnCursor = connection.cursor() queryCursor = connection.cursor() columnCursor.execute(""" select table_name, column_name, data_type from user_tab_columns order by table_name, column_id""") for tableName, columnName, dataType in columnCursor: if dataType in ("CHAR", "VARCHAR2"): selectItem = "length(%s)" % columnName elif dataType == "NUMBER": selectItem = "length(to_char(%s)) % columnName else: continue queryCursor.execute("select max(%s) from %s" % (selectItem, tableName)) result, = queryCursor.fetchone() print "Maximum length for %s.%s is %s" % (tableName, columnName, result) Anthony On Mon, Jul 26, 2010 at 10:46 AM, Chuck White <chu...@ch...> wrote: > Thanks for your response. Can you please send me examples which query those tables (user_tab_columns, all_tab_columns or dba_tab_columns)? > > ---- Anthony Tuininga <ant...@gm...> wrote: >> Hi, >> >> Just a suggestion: avoid the parsing completely if you can do it. If >> all you're trying to do is calculate the maximum size of fields, you >> can easily write something that does this: >> >> select max(length(<COL_NAME>)) from <TABLE_NAME> >> >> or for numbers >> >> select max(length(to_char(<COL_NAME>)) from <TABLE_NAME> >> >> This sort of query is far more efficient and can be done independently >> of your extracts. You can quite easily generate all of these queries >> by reading from user_tab_columns, all_tab_columns or dba_tab_columns. >> If you need help with that, let me know. >> >> Anthony > > |
From: Chuck W. <chu...@ch...> - 2010-07-27 20:56:24
|
Thanks again for your response and code. We do have a view called all_tab_cols. The issue I am running into is for some of our tables have 3billion+ records and 80+columns. It is taking 10+mins to compute the max on one column. It is trivial to generate SQL which will get the information for all columns: select min(sz.fld1), max(sz.fld1), min(sz.fld2), max(sz.fld2), ... from ( select length(to_char(fld1)) fld1, length(fld2) fld2, ... from tbl1 ) sz appropriately depending on the type. Is there an alternative/cleaner approach? Thanks. ---- Anthony Tuininga <ant...@gm...> wrote: > Sure. You want something like this -- untested but it should give you > the right idea. > > connection = cx_Oracle.Connection(....) > columnCursor = connection.cursor() > queryCursor = connection.cursor() > columnCursor.execute(""" > select table_name, column_name, data_type > from user_tab_columns > order by table_name, column_id""") > for tableName, columnName, dataType in columnCursor: > if dataType in ("CHAR", "VARCHAR2"): > selectItem = "length(%s)" % columnName > elif dataType == "NUMBER": > selectItem = "length(to_char(%s)) % columnName > else: > continue > queryCursor.execute("select max(%s) from %s" % (selectItem, tableName)) > result, = queryCursor.fetchone() > print "Maximum length for %s.%s is %s" % (tableName, columnName, result) > > Anthony > > On Mon, Jul 26, 2010 at 10:46 AM, Chuck White <chu...@ch...> wrote: > > Thanks for your response. Can you please send me examples which query those tables (user_tab_columns, all_tab_columns or dba_tab_columns)? > > > > ---- Anthony Tuininga <ant...@gm...> wrote: > >> Hi, > >> > >> Just a suggestion: avoid the parsing completely if you can do it. If > >> all you're trying to do is calculate the maximum size of fields, you > >> can easily write something that does this: > >> > >> select max(length(<COL_NAME>)) from <TABLE_NAME> > >> > >> or for numbers > >> > >> select max(length(to_char(<COL_NAME>)) from <TABLE_NAME> > >> > >> This sort of query is far more efficient and can be done independently > >> of your extracts. You can quite easily generate all of these queries > >> by reading from user_tab_columns, all_tab_columns or dba_tab_columns. > >> If you need help with that, let me know. > >> > >> Anthony > > > > |
From: D.R. B. <da...@as...> - 2010-07-28 08:33:19
|
Hoi Chuck, Although it's not using cx_Oracle, you might want to try Oracle SQL Developer http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html It will allow you to export to .csv quickly and - if you're lucky - it may even allow you to migrate your data to Netezza directly. If you have a Netezza jdbc driver and it works it's probably the simplest and fastest way to copy your data. Good luck, Danny On Tue, Jul 27, 2010 at 04:56:16PM -0400, Chuck White wrote: > Thanks again for your response and code. We do have a view called all_tab_cols. The issue I am running into is for some of our tables have 3billion+ records and 80+columns. It is taking 10+mins to compute the max on one column. > > It is trivial to generate SQL which will get the information for all columns: > > select min(sz.fld1), max(sz.fld1), min(sz.fld2), max(sz.fld2), ... > from > ( > select length(to_char(fld1)) fld1, length(fld2) fld2, ... > from tbl1 > ) sz > appropriately depending on the type. > > Is there an alternative/cleaner approach? Thanks. > > ---- Anthony Tuininga <ant...@gm...> wrote: > > Sure. You want something like this -- untested but it should give you > > the right idea. > > > > connection = cx_Oracle.Connection(....) > > columnCursor = connection.cursor() > > queryCursor = connection.cursor() > > columnCursor.execute(""" > > select table_name, column_name, data_type > > from user_tab_columns > > order by table_name, column_id""") > > for tableName, columnName, dataType in columnCursor: > > if dataType in ("CHAR", "VARCHAR2"): > > selectItem = "length(%s)" % columnName > > elif dataType == "NUMBER": > > selectItem = "length(to_char(%s)) % columnName > > else: > > continue > > queryCursor.execute("select max(%s) from %s" % (selectItem, tableName)) > > result, = queryCursor.fetchone() > > print "Maximum length for %s.%s is %s" % (tableName, columnName, result) > > > > Anthony > > > > On Mon, Jul 26, 2010 at 10:46 AM, Chuck White <chu...@ch...> wrote: > > > Thanks for your response. Can you please send me examples which query those tables (user_tab_columns, all_tab_columns or dba_tab_columns)? > > > > > > ---- Anthony Tuininga <ant...@gm...> wrote: > > >> Hi, > > >> > > >> Just a suggestion: avoid the parsing completely if you can do it. If > > >> all you're trying to do is calculate the maximum size of fields, you > > >> can easily write something that does this: > > >> > > >> select max(length(<COL_NAME>)) from <TABLE_NAME> > > >> > > >> or for numbers > > >> > > >> select max(length(to_char(<COL_NAME>)) from <TABLE_NAME> > > >> > > >> This sort of query is far more efficient and can be done independently > > >> of your extracts. You can quite easily generate all of these queries > > >> by reading from user_tab_columns, all_tab_columns or dba_tab_columns. > > >> If you need help with that, let me know. > > >> > > >> Anthony > > > > > > > > > ------------------------------------------------------------------------------ > The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://ad.doubleclick.net/clk;226879339;13503038;l? > http://clk.atdmt.com/CRS/go/247765532/direct/01/ > _______________________________________________ > cx-oracle-users mailing list > cx-...@li... > https://lists.sourceforge.net/lists/listinfo/cx-oracle-users |