JDBC batch update is faster than single update per row. I could see 5x performance improvement in some cases on our cluster. I just replace 'executeUpdate( )' with 'addBatch()' and 'executeBatch()'. and I call the 'executeBatch()' per file. and This is a straitforward implementation. I do not consider 'batch size', 'commit size' or something like that. but it works fine so far.
I think we can configure the batch size for export in Cloudbase's configuration.
Thats a good point. Furhther, I was thinking of doing the insert into RDBS in parallel- once the mappers or reducers finished execution, directly push the data into RDBMS.
I guess a new output format- DBOutputFormat can be created for this. Hadoop already has something like that, but I have not seen the code yet. If that is suitable, we can use that else create a new class. In that class, we can make use of your suggestions- use executeBatch( ) instead of executeUpdate( ).
Any thoughts?
You're right. DBOutputFormat would be better because DBOutputFormat is a straightforward interface using Hadoop's MapReduce API. but I think it is not urgent. CB's current Implementation works fine.
You're right. DBOutputFormat would be better because DBOutputFormat is a
straightforward interface using Hadoop's MapReduce API. but I think it is
not urgent. CB's current Implementation works fine.