|
From: Brad B. <be...@bl...> - 2016-06-01 00:56:07
|
Daniel, Edgar, Adding to Bryan's comments, the general starting points for improving load speed are below. With fast disks, customized branching factors, and custom vocabularies, we've seen sustained loads for data sets such as Pubchem (~10B stmts) at 85k stmts/s. - Setting branching factors in your properties file using the DumpJournal utility: https://wiki.blazegraph.com/wiki/index.php/IOOptimization#Branching_Factors. This involves doing a partial load of the data, running the utilities, and then updating the configuration. - Building a custom vocabulary and configuring inlining: https://wiki.blazegraph.com/wiki/index.php/InlineIVs. There's an example Pubchem custom vocabulary at https://github.com/blazegraph/database/tree/master/vocabularies/src/main/java/com/blazegraph/vocab/pubchem . Thanks, --Brad On Tue, May 31, 2016 at 8:19 PM, Daniel Hernández <da...@de...> wrote: > Thanks Bryan for your clarification. I will repeat my experiments with > different > hardware configurations in the future. > > Daniel > > El 31/05/16 a las 20:00, Bryan Thompson escribió: > > Sata is a non starter for blazegraph unless it is ssd. The lack of write > reordering combined with high seek latency significantly limits > performance. This could be different for other engines. Blazegraph (the > open source platform) is pretty disk oriented. The gpu platform is focused > on high performance in fast memory. > > Bryan > On May 31, 2016 7:00 PM, "Daniel Hernández" <da...@de...> wrote: > > Edgar, > > I confirm that the loading rate decreases while the database increases its > size. > > I have loaded 500M of triples and it have taken 23h using a triple store > back-end. I loaded the same amount of quads using the quad store back-end > and it takes 67h. The resulting databases have 61GB and 120GB, > respectively. My machine has 2xSATA disks on RAID 1, 32GB of RAM a 2xIntel > Xeon with Six Core. I use the parameter -Xmx6g when loading (For small > files, I got better results with 6g than with 5g and 8g). > > I have seen that using SSD improves at least 3 times the elapsed loading > time. However, this could be true for every engine. Edgar, if you improve > your loading times without changing your machine I will be grateful if you > tell us how to yo did it. > > (By the way, I loaded the same files into Virtuoso and it required > approximately 4 hours for each file.) > > Cheers, > Daniel > > El 31/05/16 a las 18:15, Bryan Thompson escribió: > > Edgar, > > There is no single configuration for maximum load throughput. Instead > there are a variety of steps you can take to improve load performance. For > example, right sizing the jvm, using fast disk, maximizing inlining, etc. > Beyond these steps and those detailed on the wiki, we look at the entire > system to identify and remove bottlenecks. > > Thanks, > Bryan > On May 31, 2016 5:28 PM, "Edgar Rodriguez-Diaz" <ed...@sy...> wrote: > >> A correction here on the data size, it’s not 180G - it’s 18G of a gzip >> trig file exported by blazegraph; number of triples is correct. >> >> > On May 31, 2016, at 10:42 AM, Edgar Rodriguez-Diaz <ed...@sy...> >> wrote: >> > >> > Hi, >> > >> > I’ve been trying to use the DataLoader tool for bulk loading a very >> large file into blazegraph (~180G with ~4 billion triples) with and empty >> journal file, but I’m noticing a performance degradation on the rate of >> triples/s loaded. It started at around 55K and after 200 M triples the rate >> is around 32K, the rate keeps going down consistently. >> > What is the configuration to get the best performance out of the bulk >> load into blazegraph? >> > >> > Thanks. >> > >> > - Edgar >> >> >> >> ------------------------------------------------------------------------------ >> What NetFlow Analyzer can do for you? Monitors network bandwidth and >> traffic >> patterns at an interface-level. Reveals which users, apps, and protocols >> are >> consuming the most bandwidth. Provides multi-vendor support for NetFlow, >> J-Flow, sFlow and other flows. Make informed decisions using capacity >> planning reports. >> https://ad.doubleclick.net/ddm/clk/305295220;132659582;e >> _______________________________________________ >> Bigdata-developers mailing list >> Big...@li... >> https://lists.sourceforge.net/lists/listinfo/bigdata-developers >> > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > > > > _______________________________________________ > Bigdata-developers mailing lis...@li...://lists.sourceforge.net/lists/listinfo/bigdata-developers > > > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and > traffic > patterns at an interface-level. Reveals which users, apps, and protocols > are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > Bigdata-developers mailing list > Big...@li... > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > > > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and > traffic > patterns at an interface-level. Reveals which users, apps, and protocols > are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > Bigdata-developers mailing list > Big...@li... > https://lists.sourceforge.net/lists/listinfo/bigdata-developers > > -- _______________ Brad Bebee CEO Blazegraph e: be...@bl... m: 202.642.7961 w: www.blazegraph.com Blazegraph products help to solve the Graph Cache Thrash to achieve large scale processing for graph and predictive analytics. Blazegraph is the creator of the industry’s first GPU-accelerated high-performance database for large graphs, has been named as one of the “10 Companies and Technologies to Watch in 2016” <http://insideanalysis.com/2016/01/20535/>. Blazegraph Database <https://www.blazegraph.com/> is our ultra-high performance graph database that supports both RDF/SPARQL and Apache TinkerPop™ APIs. Blazegraph GPU <https://www.blazegraph.com/product/gpu-accelerated/> andBlazegraph DAS <https://www.blazegraph.com/product/gpu-accelerated/>L are disruptive new technologies that use GPUs to enable extreme scaling that is thousands of times faster and 40 times more affordable than CPU-based solutions. CONFIDENTIALITY NOTICE: This email and its contents and attachments are for the sole use of the intended recipient(s) and are confidential or proprietary to SYSTAP, LLC DBA Blazegraph. Any unauthorized review, use, disclosure, dissemination or copying of this email or its contents or attachments is prohibited. If you have received this communication in error, please notify the sender by reply email and permanently delete all copies of the email and its contents and attachments. |