Hi,
My team is currently performing some tests on postgres XL for an eventual usage as BI database.
Our testing includes data file accumulation in one table. It works as follow:
1 We first load a csv in a staging table
2 We then insert the data form this staging table in a "cumulative" table
3 We then update (flag as obsolete) any former entries in the cumulative table existing on staging table.
4 We then truncate the staging table and load a new file.
We inconsistancy in the results: when files are more than 100'000 rows, we notice that the 3d step, UPDATE query, can run in some seconds or get stuck for half an hour at a specific load and come back to normal in timing for the next executions. Hereafter can be found an example of test results:
No issue:
+ Meta tables deployment in 0s
+ Test tables deployment in 0s
+ RBS_MOBCDR : initial load in ODS in 17s
+ RBS_MOBCDR : second load in 17s
+ RBS_MOBCDR : third load in 16s
+ RBS_MOBCDR : fourth load in 16s
+ RBS_MOBCDR : fifth load in 17s
+ RBS_MOBCDR : sixth load in 17s
+ RBS_MOBCDR : seventh load in 19s
+ RBS_MOBCDR : eigth load in 20s
+ RBS_MOBCDR : ninth load in 17s
+ RBS_MOBCDR : tenth load in 17s
TESTS ENDED SUCCESSFULLY
Slow at step 5:
+ Meta tables deployment in 0s
+ Test tables deployment in 0s
+ RBS_MOBCDR : initial load in 17s
+ RBS_MOBCDR : second load in 18s
+ RBS_MOBCDR : third load in 16s
+ RBS_MOBCDR : fourth load in 17s
+ RBS_MOBCDR : fifth load in 6291s
+ RBS_MOBCDR : sixth load in 17s
+ RBS_MOBCDR : seventh load in 17s
+ RBS_MOBCDR : eigth load in 17s
+ RBS_MOBCDR : ninth load in 16s
+ RBS_MOBCDR : tenth load in 17s
TESTS ENDED SUCCESSFULLY
Slow at step 8:
+ Meta tables deployment in 0s
+ Test tables deployment in 0s
+ RBS_MOBCDR : initial load in 16s
+ RBS_MOBCDR : second load in 16s
+ RBS_MOBCDR : third load in 19s
+ RBS_MOBCDR : fourth load in 16s
+ RBS_MOBCDR : fifth load in 17s
+ RBS_MOBCDR : sixth load in 16s
+ RBS_MOBCDR : seventh load in 19s
+ RBS_MOBCDR : eigth load in 6425s
+ RBS_MOBCDR : ninth load in 18s
+ RBS_MOBCDR : tenth load in 17s
Slow at step 7 and 9:
+ Meta tables deployment in 0s
+ Test tables deployment in 0s
+ RBS_MOBCDR : initial load in 16s
+ RBS_MOBCDR : second load in 17s
+ RBS_MOBCDR : third load in 17s
+ RBS_MOBCDR : fourth load in 18s
+ RBS_MOBCDR : fifth load in 16s
+ RBS_MOBCDR : sixth load in 17s
+ RBS_MOBCDR : seventh load in 6394s
+ RBS_MOBCDR : eigth load in 17s
+ RBS_MOBCDR : ninth load in 16s
+ RBS_MOBCDR : tenth load in 5480s
We several types of result:
- in some cases, there is no bottleneck and all 10 loads are done in 16-17s.
- most of the time, there is one load that is very slow but not always at the
same file load. Sometimes it is at 6, 7, 8 or 9 loaded files.
- we also noticed that sometimes two loads are taking more than an hour.
We can not find any error in the logs from the server and we did not notice specific
peak in IO rates.
Coordinator and Datanode configuration.
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_min_duration_statement = 3000
log_checkpoints = off
log_line_prefix = '%t-%r-%x-%c '
log_statement = 'none'
auto_explain.log_min_duration=4000
checkpoint_segments = 128
shared_buffers = 2048MB
shared_queues = 64
shared_queue_size = 2048kB
wal_buffers = 16384kB
effective_cache_size = 3000MB
work_mem = 128MB # min 64kB
maintenance_work_mem = 256MB # min 1MB
max_stack_depth = 5MB
seq_page_cost = 0.8
random_page_cost = 1.5
listen_addresses = '*'
max_connections = 100
current structure is 2 datanodes, 1 coordinator and 1 GTM working on the same server with 16GB of RAM and SDD disk of 1TB.
We have two processes (one per datanode) taking 100% CPU.
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
25516 postgres 20 0 2168M 181M 172M R 100. 1.1 26:49.96 postgres: harmonie pgbi X.X.X.X(35501) REMOTE SUBPLAN
25518 postgres 20 0 2168M 190M 180M R 100. 1.2 26:50.45 postgres: harmonie pgbi X.X.X.X (50993) REMOTE SUBPLAN
After some days of analysis, we found the autovacuum caused this bottleneck. After changing the configuration of autovacuum, we don't face this issue anymore. You can then close this ticket.