Blazegraph (powered by bigdata) / Discussion / Help: Blazegraph DataLoader Performance

Alexey Emtsov - 2016-10-24

Hello!
We try to load about 6 billions triples into database but during data loading faced with performance degradation problem. New database was created and
in the beginning performance was pretty good -

... - file:: 46964 stmts added in 0.519 secs, rate= 90489, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1324166 stmts added in 21.805 secs, rate= 60727, commitLatency=0ms, {failSet=0,goodSet=31}; ... - file:: 41112 stmts added in 0.454 secs, rate= 90555, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1365278 stmts added in 22.259 secs, rate= 61335, commitLatency=0ms, {failSet=0,goodSet=32} ... - file:: 63538 stmts added in 0.892 secs, rate= 71230, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1868122 stmts added in 28.763 secs, rate= 64946, commitLatency=0ms, {failSet=0,goodSet=48};

But during the time performance became worse and worse and after loading just about 200 millions dataloader shown following results -

... - file:: 66476 stmts added in 35.012 secs, rate= 1898, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 190802 stmts added in 108.809 secs, rate= 1753, commitLatency=0ms, {failSet=0,goodSet=3}; ... - file:: 61795 stmts added in 28.199 secs, rate= 2191, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 252597 stmts added in 137.008 secs, rate= 1843, commitLatency=0ms, {failSet=0,goodSet=4}; ... - file:: 62623 stmts added in 24.992 secs, rate= 2505, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 315220 stmts added in 162.0 secs, rate= 1945, commitLatency=0ms, {failSet=0,goodSet=5};

We use server with linux RedHat 6.6 installed , 128GB RAM memory and couple SSD disks. CPU speed and cores numers looks like is not an issue because during processing overall CPU loading was not exceed 20%.

Here is the command-line and properties file we use

*** command line java -server -Xmx64g -XX:+UseG1GC -XX:MaxDirectMemorySize=16000m -Dlog4j.configuration=log4j.properties -Djetty.port=9996 -Dbigdata.propertyFile=RWStore.properties -jar blazegraph.jar *** property file com.bigdata.journal.AbstractJournal.file=bigdata.jnl com.bigdata.journal.AbstractJournal.bufferMode=DiskRW com.bigdata.service.AbstractTransactionService.minReleaseAge=0 com.bigdata.btree.writeRetentionQueue.capacity=50000 com.bigdata.btree.BTree.branchingFactor=1024 com.bigdata.rdf.sail.bufferCapacity=500000 com.bigdata.rdf.sail.truthMaintenance=false com.bigdata.rdf.store.AbstractTripleStore.quads=true com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms com.bigdata.rdf.store.AbstractTripleStore.textIndex=false com.bigdata.journal.AbstractJournal.writeCacheBufferCount=14000 *** dataloader properties com.bigdata.rdf.store.DataLoader.flush=true com.bigdata.rdf.store.DataLoader.bufferCapacity=30000 com.bigdata.rdf.store.DataLoader.queueCapacity=5 com.bigdata.rdf.store.DataLoader.commit=Batch com.bigdata.rdf.store.DataLoader.verbose=1

Also we adjusted all branching factors based on recommendations provided by status?dumpPages&dumpJournal command

Here is also information about data distribution across RWStore and blobs

------------------------- RWStore Allocator Summary ------------------------- AllocatorSize AllocatorCount SlotsAllocated %SlotsAllocated SlotsRecycled SlotChurn SlotsInUse %SlotsInUse MeanAllocation SlotsReserved %SlotsUnused BytesReserved BytesAppData %SlotWaste %AppData %StoreFile %TotalWaste %FileWaste 64 5127 51341146 30.01 21811089 1.40 29530057 40.24 40 36746240 19.64 2351759360 1889923648 19.64 3.75 4.31 10.97 0.85 128 5266 37829689 22.11 312997 1.00 37516692 51.13 69 37743616 0.60 4831182848 4802136576 0.60 9.53 8.85 0.69 0.05 192 8 343335 0.20 310715 5.99 32620 0.04 160 57344 43.12 11010048 6263040 43.12 0.01 0.02 0.11 0.01 320 12 681628 0.40 635454 7.92 46174 0.06 255 86016 46.32 27525120 14775680 46.32 0.03 0.05 0.30 0.02 512 15 966104 0.56 904292 8.99 61812 0.08 416 107520 42.51 55050240 31647744 42.51 0.06 0.10 0.56 0.04 768 24 1277702 0.75 1170704 7.43 106998 0.15 637 172032 37.80 132120576 82174464 37.80 0.16 0.24 1.19 0.09 1024 35 1288267 0.75 1124298 5.17 163969 0.22 892 249088 34.17 255066112 167904256 34.17 0.33 0.47 2.07 0.16 2048 44 4381529 2.56 4084802 13.89 296727 0.40 1513 315392 5.92 645922816 607696896 5.92 1.21 1.18 0.91 0.07 3072 63 3868907 2.26 3443739 8.57 425168 0.58 2586 451584 5.85 1387266048 1306116096 5.85 2.59 2.54 1.93 0.15 4096 45 5798404 3.39 5519159 18.06 279245 0.38 3670 321024 13.01 1314914304 1143787520 13.01 2.27 2.41 4.06 0.31 8192 742 63325312 37.01 58403549 11.91 4921763 6.71 6963 5318656 7.46 43570429952 40319082496 7.46 80.04 79.83 77.22 5.96

------------------------- BLOBS ------------------------- Bucket(K) Allocations Allocated Deletes Current Mean 16 18480662 209270698950 17489631 991031 11323 32 4050241 74923579852 4024017 26224 18498 64 9 310663 9 0 34518 128 0 0 0 0 0 256 0 0 0 0 0 512 0 0 0 0 0 1024 0 0 0 0 0 2048 985 1071538160 985 0 1087856 4096 0 0 0 0 0 8192 0 0 0 0 0 16384 0 0 0 0 0 32768 0 0 0 0 0 65536 0 0 0 0 0 2097151 0 0 0 0 0

Could you please provide thoughts what we doing wrong and how performance can be improved?

Last edit: Alexey Emtsov 2016-10-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Brad Bebee - 2016-10-24
  
  Alexey,
  
  Assuming that you are using a relatively fast SSD disk (if not, this is our
  first recommendation), you can likely benefit from developing a custom
  vocabulary for your data to enable inlining.
  
  See an example at
  https://github.com/blazegraph/database/tree/master/vocabularies.
  
  Thanks, --Brad
  
  On Mon, Oct 24, 2016 at 7:26 AM, Alexey Emtsov mirguan@users.sf.net wrote:
  
  Hello!
  We try to load about 6 billions triples into database but during data
  loading faced with performance degradation problem. New database was
  created and
  in the beginning performance was pretty good -
  
  ... - file:: 46964 stmts added in 0.519 secs, rate= 90489, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1324166 stmts added in 21.805 secs, rate= 60727, commitLatency=0ms, {failSet=0,goodSet=31};... - file:: 41112 stmts added in 0.454 secs, rate= 90555, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1365278 stmts added in 22.259 secs, rate= 61335, commitLatency=0ms, {failSet=0,goodSet=32}... - file:: 63538 stmts added in 0.892 secs, rate= 71230, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 1868122 stmts added in 28.763 secs, rate= 64946, commitLatency=0ms, {failSet=0,goodSet=48};
  
  But during the time performance became worse and worse and after loading
  just about 200 millions dataloader shown following results -
  
  ... - file:: 66476 stmts added in 35.012 secs, rate= 1898, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 190802 stmts added in 108.809 secs, rate= 1753, commitLatency=0ms, {failSet=0,goodSet=3};... - file:: 61795 stmts added in 28.199 secs, rate= 2191, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 252597 stmts added in 137.008 secs, rate= 1843, commitLatency=0ms, {failSet=0,goodSet=4}; ... - file:: 62623 stmts added in 24.992 secs, rate= 2505, commitLatency=0ms, {failSet=0,goodSet=1}; totals:: 315220 stmts added in 162.0 secs, rate= 1945, commitLatency=0ms, {failSet=0,goodSet=5};
  
  We use server with linux RedHat 6.6 installed and 128GB memory. CPU and
  cores is not and issue because during processing CPU loading were not
  exceed 20%.
  
  Here is the command-line and properties file we use
  
  *** command line
  java -server -Xmx64g -XX:+UseG1GC -XX:MaxDirectMemorySize=16000m -Dlog4j.configuration=log4j.properties -Djetty.port=9996 -Dbigdata.propertyFile=RWStore.properties -jar blazegraph.jar
  
  *** property file
  com.bigdata.journal.AbstractJournal.file=bigdata.jnl
  com.bigdata.journal.AbstractJournal.bufferMode=DiskRW
  com.bigdata.service.AbstractTransactionService.minReleaseAge=0
  com.bigdata.btree.writeRetentionQueue.capacity=50000
  com.bigdata.btree.BTree.branchingFactor=1024
  com.bigdata.rdf.sail.bufferCapacity=500000
  com.bigdata.rdf.sail.truthMaintenance=false
  com.bigdata.rdf.store.AbstractTripleStore.quads=true
  com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=false
  com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
  com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
  com.bigdata.journal.AbstractJournal.writeCacheBufferCount=14000
  
  *** dataloader properties
  com.bigdata.rdf.store.DataLoader.flush=true
  com.bigdata.rdf.store.DataLoader.bufferCapacity=30000
  com.bigdata.rdf.store.DataLoader.queueCapacity=5
  com.bigdata.rdf.store.DataLoader.commit=Batch
  com.bigdata.rdf.store.DataLoader.verbose=1
  
  Also we adjusted all branching factors based on recommendations provided
  by status?dumpPages&dumpJournal command
  
  Here is also information about data distribution across RWStore and blobs
  
  RWStore Allocator Summary
  
  AllocatorSize AllocatorCount SlotsAllocated %SlotsAllocated SlotsRecycled
  SlotChurn SlotsInUse %SlotsInUse MeanAllocation SlotsReserved %SlotsUnused
  BytesReserved BytesAppData %SlotWaste %AppData %StoreFile %TotalWaste
  %FileWaste
  64 5127 51341146 30.01 21811089 1.40 29530057 40.24 40 36746240 19.64
  2351759360 1889923648 19.64 3.75 4.31 10.97 0.85
  128 5266 37829689 22.11 312997 1.00 37516692 51.13 69 37743616 0.60
  4831182848 4802136576 0.60 9.53 8.85 0.69 0.05
  192 8 343335 0.20 310715 5.99 32620 0.04 160 57344 43.12 11010048 6263040
  43.12 0.01 0.02 0.11 0.01
  320 12 681628 0.40 635454 7.92 46174 0.06 255 86016 46.32 27525120
  14775680 46.32 0.03 0.05 0.30 0.02
  512 15 966104 0.56 904292 8.99 61812 0.08 416 107520 42.51 55050240
  31647744 42.51 0.06 0.10 0.56 0.04
  768 24 1277702 0.75 1170704 7.43 106998 0.15 637 172032 37.80 132120576
  82174464 37.80 0.16 0.24 1.19 0.09
  1024 35 1288267 0.75 1124298 5.17 163969 0.22 892 249088 34.17 255066112
  167904256 34.17 0.33 0.47 2.07 0.16
  2048 44 4381529 2.56 4084802 13.89 296727 0.40 1513 315392 5.92 645922816
  607696896 5.92 1.21 1.18 0.91 0.07
  3072 63 3868907 2.26 3443739 8.57 425168 0.58 2586 451584 5.85 1387266048
  1306116096 5.85 2.59 2.54 1.93 0.15
  4096 45 5798404 3.39 5519159 18.06 279245 0.38 3670 321024 13.01
  1314914304 1143787520 13.01 2.27 2.41 4.06 0.31
  8192 742 63325312 37.01 58403549 11.91 4921763 6.71 6963 5318656 7.46
  43570429952 40319082496 7.46 80.04 79.83 77.22 5.96
  
  BLOBS
  
  Bucket(K) Allocations Allocated Deletes Current Mean
  16 18480662 209270698950 17489631 991031 11323
  32 4050241 74923579852 4024017 26224 18498
  64 9 310663 9 0 34518
  128 0 0 0 0 0
  256 0 0 0 0 0
  512 0 0 0 0 0
  1024 0 0 0 0 0
  2048 985 1071538160 985 0 1087856
  4096 0 0 0 0 0
  8192 0 0 0 0 0
  16384 0 0 0 0 0
  32768 0 0 0 0 0
  65536 0 0 0 0 0
  2097151 0 0 0 0 0
  
  Could you please provide thoughts what we doing wrong and how performance
  can be improved?
  
  Blazegraph DataLoader Performance
  https://sourceforge.net/p/bigdata/discussion/676946/thread/21bbae1a/?limit=25#f5dc
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/bigdata/discussion/676946/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexey Emtsov - 2016-10-26

Brad,
Thank you for response and suggestion!
We created custom vocabulary were defined our common types and identifiers we use.
Aslo we extended InlineURIFactory with attempt to optimize inlinig.
We exeuted some different tests with loading but looks like aplying these changes in different combinations have no effect for perfromance and we got quite similar results, just may be with 5-10% improvement max.

Alexey

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Brad Bebee - 2016-10-26
  
  Alexey,
  
  Can you post dumpJournal with -pages and your journal properties to verify
  inlining is taking place?
  
  Thanks, Brad
  
  On Oct 26, 2016 6:27 AM, "Alexey Emtsov" mirguan@users.sf.net wrote:
  
  Brad,
  Thank you for response and suggestion!
  We created custom vocabulary were defined our common types and identifiers
  we use.
  Aslo we extended InlineURIFactory with attempt to optimize inlinig.
  We exeuted some different tests with loading but looks like aplying these
  changes in different combinations have no effect for perfromance and we got
  quite similar results, just may be with 5-10% improvement max.
  
  Alexey
  
  Blazegraph DataLoader Performance
  https://sourceforge.net/p/bigdata/discussion/676946/thread/21bbae1a/?limit=25#c535
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/bigdata/discussion/676946/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexey Emtsov - 2016-10-28

Here is it

------------------------- RWStore Allocator Summary ------------------------- AllocatorSize AllocatorCount SlotsAllocated %SlotsAllocated SlotsRecycled SlotChurn SlotsInUse %SlotsInUse MeanAllocation SlotsReserved %SlotsUnused BytesReserved BytesAppData %SlotWaste %AppData %StoreFile %TotalWaste %FileWaste 64 4452 38900620 24.95 13334417 1.22 25566203 38.43 45 31910912 19.88 2042298368 1636236992 19.88 3.74 4.34 12.08 0.86 128 4960 35569516 22.81 108642 1.00 35460874 53.31 69 35553280 0.26 4550819840 4538991872 0.26 10.39 9.67 0.35 0.03 192 5 141887 0.09 116339 3.96 25548 0.04 158 35840 28.72 6881280 4905216 28.72 0.01 0.01 0.06 0.00 320 6 264922 0.17 241155 6.16 23767 0.04 254 43008 44.74 13762560 7605440 44.74 0.02 0.03 0.18 0.01 512 9 357273 0.23 318708 5.54 38565 0.06 417 64512 40.22 33030144 19745280 40.22 0.05 0.07 0.40 0.03 768 32 553307 0.35 401500 2.43 151807 0.23 616 228096 33.45 175177728 116587776 33.45 0.27 0.37 1.74 0.12 1024 7 397688 0.26 367347 7.93 30341 0.05 894 50176 39.53 51380224 31069184 39.53 0.07 0.11 0.60 0.04 2048 11 1345273 0.86 1278633 17.06 66640 0.10 1513 78848 15.48 161480704 136478720 15.48 0.31 0.34 0.74 0.05 3072 108 3753397 2.41 3019400 4.88 733997 1.10 2687 769024 4.55 2362441728 2254838784 4.55 5.16 5.02 3.20 0.23 4096 55 16139046 10.35 15830487 40.94 308559 0.46 3631 394240 21.73 1614807040 1263857664 21.73 2.89 3.43 10.44 0.75 8192 614 58510539 37.52 54397394 13.29 4113145 6.18 6263 4401152 6.54 36054237184 33694883840 6.54 77.10 76.60 70.20 5.01

------------------------- BLOBS ------------------------- Bucket(K) Allocations Allocated Deletes Current Mean 16 10093108 125168179125 9654905 438203 12401 32 3590088 66461722547 3566071 24017 18512 64 11 383617 10 1 34874 128 0 0 0 0 0 256 0 0 0 0 0 512 0 0 0 0 0 1024 0 0 0 0 0 2048 955 1038902480 955 0 1087856 4096 0 0 0 0 0 8192 0 0 0 0 0 16384 0 0 0 0 0 32768 0 0 0 0 0 65536 0 0 0 0 0 2097151 0 0 0 0 0

Also when we specifed inlineURIFactory performance dropped down. So, looks like vocabulary and inlining is on...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Brad Bebee - 2016-10-28
  
  Martyn: Do you any thoughts on this dump journal?
  
  Alexey: Can you please confirm the drive type you are using?
  
  On Fri, Oct 28, 2016 at 6:38 AM, Alexey Emtsov mirguan@users.sf.net wrote:
  
  Here is it
  
  RWStore Allocator Summary
  
  AllocatorSize AllocatorCount SlotsAllocated %SlotsAllocated SlotsRecycled SlotChurn SlotsInUse %SlotsInUse MeanAllocation SlotsReserved %SlotsUnused BytesReserved BytesAppData %SlotWaste %AppData %StoreFile %TotalWaste %FileWaste
  64 4452 38900620 24.95 13334417 1.22 25566203 38.43 45 31910912 19.88 2042298368 1636236992 19.88 3.74 4.34 12.08 0.86
  128 4960 35569516 22.81 108642 1.00 35460874 53.31 69 35553280 0.26 4550819840 4538991872 0.26 10.39 9.67 0.35 0.03
  192 5 141887 0.09 116339 3.96 25548 0.04 158 35840 28.72 6881280 4905216 28.72 0.01 0.01 0.06 0.00
  320 6 264922 0.17 241155 6.16 23767 0.04 254 43008 44.74 13762560 7605440 44.74 0.02 0.03 0.18 0.01
  512 9 357273 0.23 318708 5.54 38565 0.06 417 64512 40.22 33030144 19745280 40.22 0.05 0.07 0.40 0.03
  768 32 553307 0.35 401500 2.43 151807 0.23 616 228096 33.45 175177728 116587776 33.45 0.27 0.37 1.74 0.12
  1024 7 397688 0.26 367347 7.93 30341 0.05 894 50176 39.53 51380224 31069184 39.53 0.07 0.11 0.60 0.04
  2048 11 1345273 0.86 1278633 17.06 66640 0.10 1513 78848 15.48 161480704 136478720 15.48 0.31 0.34 0.74 0.05
  3072 108 3753397 2.41 3019400 4.88 733997 1.10 2687 769024 4.55 2362441728 2254838784 4.55 5.16 5.02 3.20 0.23
  4096 55 16139046 10.35 15830487 40.94 308559 0.46 3631 394240 21.73 1614807040 1263857664 21.73 2.89 3.43 10.44 0.75
  8192 614 58510539 37.52 54397394 13.29 4113145 6.18 6263 4401152 6.54 36054237184 33694883840 6.54 77.10 76.60 70.20 5.01
  
  BLOBS
  
  Bucket(K) Allocations Allocated Deletes Current Mean
  16 10093108 125168179125 9654905 438203 12401
  32 3590088 66461722547 3566071 24017 18512
  64 11 383617 10 1 34874
  128 0 0 0 0 0
  256 0 0 0 0 0
  512 0 0 0 0 0
  1024 0 0 0 0 0
  2048 955 1038902480 955 0 1087856
  4096 0 0 0 0 0
  8192 0 0 0 0 0
  16384 0 0 0 0 0
  32768 0 0 0 0 0
  65536 0 0 0 0 0
  2097151 0 0 0 0 0
  
  Also when we specifed inlineURIFactory performance dropped down. So, looks
  like vocabulary and inlining is on...
  
  Blazegraph DataLoader Performance
  https://sourceforge.net/p/bigdata/discussion/676946/thread/21bbae1a/?limit=25#ee6a
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/bigdata/discussion/676946/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Blazegraph DataLoader Performance

Fast, scalable, robust graph database platform

Forums

Help

Blazegraph DataLoader Performance

RWStore Allocator Summary

BLOBS

Alexey

RWStore Allocator Summary

BLOBS

Blazegraph DataLoader Performance

Fast, scalable, robust graph database platform

Forums

Help

Blazegraph DataLoader Performance document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

RWStore Allocator Summary

BLOBS

Alexey

RWStore Allocator Summary

BLOBS

Blazegraph DataLoader Performance