Menu

Easy way to makes batch extract faster, in PostgreSQL?

2014-04-10
2014-04-22
  • Pawel Omelko

    Pawel Omelko - 2014-04-10

    Hi,

    We are using SymmetricDS 3.5.3 and PostgreSQL 9.3.

    In db logs we can find execution details for SDS query 'selectEventDataToExtractSql' (defined in class DataServiceSqlMap)

    [2014-04-10 10:27:11.526 GMT] symmetricds@127.0.0.1(51609):si_euro [2909] 53467086.b5d/261:0 LOG:  duration: 153.243 ms  execute S_15496/C_15497: select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id from sym_data d inner join sym_data_event e on d.data_id = e.data_id inner join sym_outgoing_batch o on o.batch_id=e.batch_id where o.batch_id = $1 and o.node_id = $2    order by d.data_id asc 
    [2014-04-10 10:27:11.526 GMT] symmetricds@127.0.0.1(51609):si_euro [2909] 53467086.b5d/262:0 DETAIL:  parameters: $1 = '14994635', $2 = '000'
    

    As you can see was executed in 153.243 ms. When I'm trying doing the same in psgl, it's more quicker:

    explain analyse select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
    from sym_data d 
    inner join sym_data_event e on d.data_id = e.data_id 
    inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
    where o.batch_id = '14994635' and o.node_id = '000' order by d.data_id asc;
    
    RESULT:
    
    "Sort  (cost=309.86..309.92 rows=24 width=1011) (actual time=0.183..0.183 rows=2 loops=1)"
    "  Sort Key: d.data_id"
    "  Sort Method: quicksort  Memory: 26kB"
    "  ->  Nested Loop  (cost=1.27..309.31 rows=24 width=1011) (actual time=0.146..0.163 rows=2 loops=1)"
    "        ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..8.44 rows=1 width=8) (actual time=0.026..0.027 rows=1 loops=1)"
    "              Index Cond: ((batch_id = 14994635::bigint) AND (node_id = '000'::text))"
    "              Heap Fetches: 0"
    "        ->  Nested Loop  (cost=0.85..300.63 rows=24 width=1019) (actual time=0.116..0.128 rows=2 loops=1)"
    "              ->  Index Scan using sym_idx_de_batchid on sym_data_event e  (cost=0.42..97.71 rows=24 width=37) (actual time=0.080..0.082 rows=2 loops=1)"
    "                    Index Cond: (batch_id = 14994635::bigint)"
    "              ->  Index Scan using sym_idx_d_channel_id on sym_data d  (cost=0.42..8.45 rows=1 width=990) (actual time=0.018..0.018 rows=1 loops=2)"
    "                    Index Cond: (data_id = e.data_id)"
    "Total runtime: 0.251 ms"
    

    Even using prepare statement it's still faster:

    prepare sds as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
    from sym_data d 
    inner join sym_data_event e on d.data_id = e.data_id 
    inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
    where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc;
    
    explain analyze execute sds('14994635', '000');
    
    RESULT:
    
    "Sort  (cost=309.86..309.92 rows=24 width=1011) (actual time=0.072..0.072 rows=2 loops=1)"
    "  Sort Key: d.data_id"
    "  Sort Method: quicksort  Memory: 26kB"
    "  ->  Nested Loop  (cost=1.27..309.31 rows=24 width=1011) (actual time=0.039..0.055 rows=2 loops=1)"
    "        ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..8.44 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1)"
    "              Index Cond: ((batch_id = 14994635::bigint) AND (node_id = '000'::text))"
    "              Heap Fetches: 0"
    "        ->  Nested Loop  (cost=0.85..300.63 rows=24 width=1019) (actual time=0.021..0.033 rows=2 loops=1)"
    "              ->  Index Scan using sym_idx_de_batchid on sym_data_event e  (cost=0.42..97.71 rows=24 width=37) (actual time=0.010..0.011 rows=2 loops=1)"
    "                    Index Cond: (batch_id = 14994635::bigint)"
    "              ->  Index Scan using sym_idx_d_channel_id on sym_data d  (cost=0.42..8.45 rows=1 width=990) (actual time=0.005..0.007 rows=1 loops=2)"
    "                    Index Cond: (data_id = e.data_id)"
    "Total runtime: 0.156 ms"
    

    In class JdbcSqlReadCursor, after prepareStatement, you are setting values using sqlTemplate.setValues(..). In this method org.springframework.jdbc.core.StatementCreatorUtils.setParameterValue(..) to determine sqlType of parameters. I guess that this is Types.NUMERIC for batch_id and Types.VARCHAR for node_id.

    Doing the same in psql, query execution is slower:

    prepare sds_de (decimal, varchar) as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
    from sym_data d 
    inner join sym_data_event e on d.data_id = e.data_id 
    inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
    where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc;
    
    explain analyze execute sds_de('14994635', '000');
    
    RESULT:
    
    "Sort  (cost=38935.59..38948.15 rows=5026 width=1007) (actual time=1293.282..1293.284 rows=2 loops=1)"
    "  Sort Key: d.data_id"
    "  Sort Method: quicksort  Memory: 26kB"
    "  ->  Nested Loop  (cost=5142.56..38626.61 rows=5026 width=1007) (actual time=306.025..1293.268 rows=2 loops=1)"
    "        ->  Hash Join  (cost=5142.13..34751.16 rows=5078 width=29) (actual time=306.004..1293.234 rows=2 loops=1)"
    "              Hash Cond: (e.batch_id = o.batch_id)"
    "              ->  Seq Scan on sym_data_event e  (cost=0.00..25741.91 rows=1017691 width=37) (actual time=0.005..615.269 rows=1019325 loops=1)"
    "              ->  Hash  (cost=5136.37..5136.37 rows=461 width=8) (actual time=31.920..31.920 rows=1 loops=1)"
    "                    Buckets: 1024  Batches: 1  Memory Usage: 1kB"
    "                    ->  Index Only Scan using sym_outgoing_batch_pkey on sym_outgoing_batch o  (cost=0.42..5136.37 rows=461 width=8) (actual time=29.833..31.915 rows=1 loops=1)"
    "                          Index Cond: (node_id = '000'::text)"
    "                          Filter: ((batch_id)::numeric = 14994635::numeric)"
    "                          Rows Removed by Filter: 92302"
    "                          Heap Fetches: 1415"
    "        ->  Index Scan using sym_data_pkey on sym_data d  (cost=0.42..0.75 rows=1 width=986) (actual time=0.009..0.010 rows=1 loops=2)"
    "              Index Cond: (data_id = e.data_id)"
    "Total runtime: 1293.347 ms"
    

    Is it possible to change batch_id type to Types.BIGINT, to optimize this query execution from even 0,5 sec to less then 1 ms?

    Thanks in advance for any answer!

     

    Last edit: Pawel Omelko 2014-04-25
  • Eric Long

    Eric Long - 2014-04-10

    Does it help if you add the "protocolVersion=2" parameter to your database connection URL? For example, in your SymmetricDS properties file, it might look like this:

    db.url=jdbc:postgresql://localhost/corp?protocolVersion=2&stringtype=unspecified

     
  • Pawel Omelko

    Pawel Omelko - 2014-04-11

    We already tried this with no effect.

    I did some more investigation and I found that in class DataService, method selectDataFor(Batch batch) is execution of query 'selectEventDataToExtractSql' with parameters BatchId and TargetNodeId. For first param there is defined sqlType as Types.NUMERIC. Query plan for this parameters is same as for decimal.

    prepare sds_nu(numeric, varchar) as select d.data_id, d.table_name, d.event_type, d.row_data, d.pk_data, d.old_data, d.create_time, d.trigger_hist_id, d.channel_id, d.transaction_id, d.source_node_id, d.external_data, d.node_list, e.router_id 
    from sym_data d 
    inner join sym_data_event e on d.data_id = e.data_id 
    inner join sym_outgoing_batch o on o.batch_id=e.batch_id 
    where o.batch_id = $1 and o.node_id = $2 order by d.data_id asc;
    
    explain analyze execute sds_nu('14994635', '000');
    

    On the other hand SDS generate sym_outgoing_batch table with batch_id column type set to bigint. So why there is no Types.BIGINT set for query 'selectEventDataToExtractSql'?

    Is it possible to change this settings in next SDS version, customize this type in some way? As you can see execution without types can be thousands times faster.

     

    Last edit: Pawel Omelko 2014-04-11
  • Pawel Omelko

    Pawel Omelko - 2014-04-16

    Hi again,

    Waiting for yours answer, I modified method selectDataFor(Batch batch) from class DataService to change parameter type from Types.NUMERIC to Types.BIGINT. After that I made symmetric-core-3.5.3.jar and replaced original jar by it in our SDS instalations.

    The result was incredible, execution time of 'selectEventDataToExtractSql' query was less then 1 ms - before change about 30-160 ms (depends on current query plany). Without that change execution time was even 0,5 sec when sym_data has about 3 millions data!.

    Average extract time (from sym_outgoing_batch.extract_millis) decreased from about 200 to 10 ms.

    select channel_id, round(avg(extract_millis), 2) as extract,
    count(*),
    round(avg(byte_count), 0) as bytes
    from sym_outgoing_batch
    where last_update_time >='2014-04-15 12:00'
    and last_update_time <='2014-04-15 13:00'
    group by channel_id order by channel_id;
    

    RESULT

    channel extract count bytes
    ch_1 11.81 820 5969
    ch_2 9.31 52 382
    ch_3 7.67 3 691
    ch_4 8.47 664 660
    ch_5 10.00 7 850
    ch_6 12.03 302 2418
    ch_7 8.00 754 1383
    ch_8 8.55 434 1053
    ch_9 11.12 254 697
    ch_10 15.48 278 13045
    ch_11 7.09 32 1121
    heartbeat 3.13 8 333

    Is it possible to change type Types.BIGINT for described parameter in next SDS version? As you can see performance impact is huge.

    Thanks in advance.

     

    Last edit: Pawel Omelko 2014-04-16
  • Chris Henson

    Chris Henson - 2014-04-21

    Thanks for the suggestion. We'll incorporate the improvement in the next version of SymmetricDS.

     
  • Pawel Omelko

    Pawel Omelko - 2014-04-22

    Thanks Chris for your reply!

     

Log in to post a comment.