SymmetricDS / Discussion / Open Discussion: Easy way to makes batch extract faster, in PostgreSQL?

Hi,

We are using SymmetricDS 3.5.3 and PostgreSQL 9.3.

In db logs we can find execution details for SDS query 'selectEventDataToExtractSql' (defined in class DataServiceSqlMap)

[2014-04-10 10:27:11.526 GMT] symmetricds@127.0.0.1(51609):si_euro [2909] 53467086.b5d/261:0 LOG:  duration: 153.243 ms  execute S_15496/C_15497: select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id from sym_data d inner join sym_data_event e on d.data_id = e.data_id inner join sym_outgoing_batch o on o.batch_id=e.batch_id where o.batch_id = $1 and o.node_id = $2    order by d.data_id asc 
[2014-04-10 10:27:11.526 GMT] symmetricds@127.0.0.1(51609):si_euro [2909] 53467086.b5d/262:0 DETAIL:  parameters: $1 = '14994635', $2 = '000'

As you can see was executed in 153.243 ms. When I'm trying doing the same in psgl, it's more quicker:

explain analyse select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
from sym_data d 
inner join sym_data_event e on d.data_id = e.data_id 
inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
where o.batch_id = '14994635' and o.node_id = '000' order by d.data_id asc;

RESULT:

"Sort  (cost=309.86..309.92 rows=24 width=1011) (actual time=0.183..0.183 rows=2 loops=1)"
"  Sort Key: d.data_id"
"  Sort Method: quicksort  Memory: 26kB"
"  ->  Nested Loop  (cost=1.27..309.31 rows=24 width=1011) (actual time=0.146..0.163 rows=2 loops=1)"
"        ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..8.44 rows=1 width=8) (actual time=0.026..0.027 rows=1 loops=1)"
"              Index Cond: ((batch_id = 14994635::bigint) AND (node_id = '000'::text))"
"              Heap Fetches: 0"
"        ->  Nested Loop  (cost=0.85..300.63 rows=24 width=1019) (actual time=0.116..0.128 rows=2 loops=1)"
"              ->  Index Scan using sym_idx_de_batchid on sym_data_event e  (cost=0.42..97.71 rows=24 width=37) (actual time=0.080..0.082 rows=2 loops=1)"
"                    Index Cond: (batch_id = 14994635::bigint)"
"              ->  Index Scan using sym_idx_d_channel_id on sym_data d  (cost=0.42..8.45 rows=1 width=990) (actual time=0.018..0.018 rows=1 loops=2)"
"                    Index Cond: (data_id = e.data_id)"
"Total runtime: 0.251 ms"

Even using prepare statement it's still faster:

prepare sds as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
from sym_data d 
inner join sym_data_event e on d.data_id = e.data_id 
inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc;

explain analyze execute sds('14994635', '000');

RESULT:

"Sort  (cost=309.86..309.92 rows=24 width=1011) (actual time=0.072..0.072 rows=2 loops=1)"
"  Sort Key: d.data_id"
"  Sort Method: quicksort  Memory: 26kB"
"  ->  Nested Loop  (cost=1.27..309.31 rows=24 width=1011) (actual time=0.039..0.055 rows=2 loops=1)"
"        ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..8.44 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1)"
"              Index Cond: ((batch_id = 14994635::bigint) AND (node_id = '000'::text))"
"              Heap Fetches: 0"
"        ->  Nested Loop  (cost=0.85..300.63 rows=24 width=1019) (actual time=0.021..0.033 rows=2 loops=1)"
"              ->  Index Scan using sym_idx_de_batchid on sym_data_event e  (cost=0.42..97.71 rows=24 width=37) (actual time=0.010..0.011 rows=2 loops=1)"
"                    Index Cond: (batch_id = 14994635::bigint)"
"              ->  Index Scan using sym_idx_d_channel_id on sym_data d  (cost=0.42..8.45 rows=1 width=990) (actual time=0.005..0.007 rows=1 loops=2)"
"                    Index Cond: (data_id = e.data_id)"
"Total runtime: 0.156 ms"

In class JdbcSqlReadCursor, after prepareStatement, you are setting values using sqlTemplate.setValues(..). In this method org.springframework.jdbc.core.StatementCreatorUtils.setParameterValue(..) to determine sqlType of parameters. I guess that this is Types.NUMERIC for batch_id and Types.VARCHAR for node_id.

Doing the same in psql, query execution is slower:

prepare sds_de (decimal, varchar) as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
from sym_data d 
inner join sym_data_event e on d.data_id = e.data_id 
inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc;

explain analyze execute sds_de('14994635', '000');

RESULT:

"Sort  (cost=38935.59..38948.15 rows=5026 width=1007) (actual time=1293.282..1293.284 rows=2 loops=1)"
"  Sort Key: d.data_id"
"  Sort Method: quicksort  Memory: 26kB"
"  ->  Nested Loop  (cost=5142.56..38626.61 rows=5026 width=1007) (actual time=306.025..1293.268 rows=2 loops=1)"
"        ->  Hash Join  (cost=5142.13..34751.16 rows=5078 width=29) (actual time=306.004..1293.234 rows=2 loops=1)"
"              Hash Cond: (e.batch_id = o.batch_id)"
"              ->  Seq Scan on sym_data_event e  (cost=0.00..25741.91 rows=1017691 width=37) (actual time=0.005..615.269 rows=1019325 loops=1)"
"              ->  Hash  (cost=5136.37..5136.37 rows=461 width=8) (actual time=31.920..31.920 rows=1 loops=1)"
"                    Buckets: 1024  Batches: 1  Memory Usage: 1kB"
"                    ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..5136.37 rows=461 width=8) (actual time=29.833..31.915 rows=1 loops=1)"
"                          Index Cond: (node_id = '000'::text)"
"                          Filter: ((batch_id)::numeric = 14994635::numeric)"
"                          Rows Removed by Filter: 92302"
"                          Heap Fetches: 1415"
"        ->  Index Scan using sym_data_pkey on sym_data d  (cost=0.42..0.75 rows=1 width=986) (actual time=0.009..0.010 rows=1 loops=2)"
"              Index Cond: (data_id = e.data_id)"
"Total runtime: 1293.347 ms"

Is it possible to change batch_id type to Types.BIGINT, to optimize this query execution from even 0,5 sec to less then 1 ms?

Thanks in advance for any answer!

Last edit: Pawel Omelko 2014-04-25

Eric Long - 2014-04-10

Does it help if you add the "protocolVersion=2" parameter to your database connection URL? For example, in your SymmetricDS properties file, it might look like this:

db.url=jdbc:postgresql://localhost/corp?protocolVersion=2&stringtype=unspecified

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pawel Omelko - 2014-04-11

We already tried this with no effect.

I did some more investigation and I found that in class DataService, method selectDataFor(Batch batch) is execution of query 'selectEventDataToExtractSql' with parameters BatchId and TargetNodeId. For first param there is defined sqlType as Types.NUMERIC. Query plan for this parameters is same as for decimal.

prepare sds_nu(numeric, varchar) as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id from sym_data d inner join sym_data_event e on d.data_id = e.data_id inner join sym_outgoing_batch o on o.batch_id=e.batch_id where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc; explain analyze execute sds_nu('14994635', '000');

On the other hand SDS generate sym_outgoing_batch table with batch_id column type set to bigint. So why there is no Types.BIGINT set for query 'selectEventDataToExtractSql'?

Is it possible to change this settings in next SDS version, customize this type in some way? As you can see execution without types can be thousands times faster.

Last edit: Pawel Omelko 2014-04-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi again,

Waiting for yours answer, I modified method selectDataFor(Batch batch) from class DataService to change parameter type from Types.NUMERIC to Types.BIGINT. After that I made symmetric-core-3.5.3.jar and replaced original jar by it in our SDS instalations.

The result was incredible, execution time of 'selectEventDataToExtractSql' query was less then 1 ms - before change about 30-160 ms (depends on current query plany). Without that change execution time was even 0,5 sec when sym_data has about 3 millions data!.

Average extract time (from sym_outgoing_batch.extract_millis) decreased from about 200 to 10 ms.

select channel_id, round(avg(extract_millis), 2) as extract,
count(*),
round(avg(byte_count), 0) as bytes
from sym_outgoing_batch
where last_update_time >='2014-04-15 12:00'
and last_update_time <='2014-04-15 13:00'
group by channel_id order by channel_id;

RESULT

channel	extract	count	bytes
ch_1	11.81	820	5969
ch_2	9.31	52	382
ch_3	7.67	3	691
ch_4	8.47	664	660
ch_5	10.00	7	850
ch_6	12.03	302	2418
ch_7	8.00	754	1383
ch_8	8.55	434	1053
ch_9	11.12	254	697
ch_10	15.48	278	13045
ch_11	7.09	32	1121
heartbeat	3.13	8	333

Is it possible to change type Types.BIGINT for described parameter in next SDS version? As you can see performance impact is huge.

Thanks in advance.

Last edit: Pawel Omelko 2014-04-16

Chris Henson - 2014-04-21

Thanks for the suggestion. We'll incorporate the improvement in the next version of SymmetricDS.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chris Henson - 2014-04-21

http://www.symmetricds.org/issues/view.php?id=1684

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pawel Omelko - 2014-04-22

Thanks Chris for your reply!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Easy way to makes batch extract faster, in PostgreSQL?

Database Replication, Filtered Synchronization, and Transformation

Forums

Help

Easy way to makes batch extract faster, in PostgreSQL?

Easy way to makes batch extract faster, in PostgreSQL?

Database Replication, Filtered Synchronization, and Transformation

Forums

Help

Easy way to makes batch extract faster, in PostgreSQL? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Easy way to makes batch extract faster, in PostgreSQL?