why bigdata is not so well known as neo4j or other graph database?

Help
fancy
2014-08-18
2014-08-20
  • fancy

    fancy - 2014-08-18

    hi all. I want to build my own knowledge graph. I want to use graph database to store my data. I don't want to use rdf/rdfs/owl because it's tooooo complicated and I do not need them now.
    I investigated many open source graph databases but do no know bigdata until today.
    As described in http://www.w3.org/2001/sw/wiki/Bigdata. This project is started in 2006.
    It's not a new project.
    I am astonished by it's performance. Bigdata scales to 50 billion edges on a single machine and will scale to even larger graphs with its horizontally-scaled architecture.
    The MapGraph can handle 30 billion traversed edges per second
    But I can't use google to find it by searching "graph database". it is not so well known as neo4j, orientdb, InfiniteGraph, or even new competitor titan.
    What' wrong with it?
    Why it's named "bigdata". When we talk about big data, we means something like hadoop/hbase/spark/hive. is this reason causing google think it's not a database So it's not well known?

     
    • Bryan Thompson

      Bryan Thompson - 2014-08-18

      The name of the project goes back to 2006. When google had published their
      bigtable paper. We want to generalize into a scale-out architecture, but
      bigtable (their key-value store) was a single index system and we were
      focused on graphs. So, we called it "bigdata". This was long before the
      word become used to describe things like hadoop.

      We've been mostly focused in a different market. The platform is very well
      known there. Over the last 2 years we have extended the platform to handle
      the property graph model, added vertex-centric traversal APIs. Enjoy!

      MapGraph was developed over the same time period. It provides a vertex
      centric abstraction with up to 3 billion traversed edges per second on a
      single GPU. The numbers you quoted are from the multi-GPU version of the
      MapGraph code, which is designed to scale to large GPU clusters. This
      platform is insanely fast and a huge jump beyond today's main memory graph
      platforms.

      Bryan

      On Mon, Aug 18, 2014 at 8:28 AM, fancy fancyerii@users.sf.net wrote:

      hi all. I want to build my own knowledge graph. I want to use graph
      database to store my data. I don't want to use rdf/rdfs/owl because it's
      tooooo complicated and I do not need them now.
      I investigated many open source graph databases but do no know bigdata
      until today.
      As described in http://www.w3.org/2001/sw/wiki/Bigdata. This project is
      started in 2006.
      It's not a new project.
      I am astonished by it's performance. Bigdata scales to 50 billion edges on
      a single machine and will scale to even larger graphs with its
      horizontally-scaled architecture.
      The MapGraph can handle 30 billion traversed edges per second
      But I can't use google to find it by searching "graph database". it is not
      so well known as neo4j, orientdb, InfiniteGraph, or even new competitor
      titan.
      What' wrong with it?
      Why it's named "bigdata". When we talk about big data, we means something
      like hadoop/hbase/spark/hive. is this reason causing google think it's not
      a database So it's not well known?


      why bigdata is not so well known as neo4j or other graph database?
      https://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/bigdata/discussion/676946/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • fancy

        fancy - 2014-08-18

        Thank you. is there articles about comparsions between bigdata and other graph database? Why bigdata is performant and scalable? it it built on top of something like bigtable(hbase/cassandra)? Any introduction resources of its implementation details?

         
        • Bryan Thompson

          Bryan Thompson - 2014-08-18

          Here are some performance comparisons.

          There are several factors when choosing a graph database. Most
          applications are read-mostly, but they can differ quite a bit on the query
          and update workloads.

          1. Scaling story

          2. Load throughput

          3. Graph traversal operations (parallel graph algorithms like shortest
            paths)

          4. Graph query operations (high-level, high-performance query language)

          Platform performance comparison

          graph platform

          load (s)

          query (ms)

          comments

          titan

          497.00

          935

          4 node cluster using Cassandra

          neo4j

          608.00

          668

          single node community edition

          bigdata

          396.00

          281

          single node (open source)

          MapGraph

          0.08

          27

          NVIDIA K20 GPU using larger scale free graph (24M vertices and 25M edges)

          The data was a scale-free graph with 2.7M vertices and 5.6M edges (except
          for MapGraph, see above). The query was to identify a 5-degree subgraph
          (depth-limited BFS). The MapGraph result is for a single GPU.

          The performance can obviously vary quite a bit depending on the data set,
          how the information is modeled, and the query workload. I think that there
          are two main takeaways from this.

          • First, the horizontal scaling for titan is very expensive, both in terms
            of machines and performance.

          • Second, GPUs are ridiculously fast,.


          Bryan Thompson
          Chief Scientist & Founder
          SYSTAP, LLC
          4501 Tower Road
          Greensboro, NC 27410
          bryan@systap.com
          http://bigdata.com
          http://mapgraph.io

          CONFIDENTIALITY NOTICE: This email and its contents and attachments are
          for the sole use of the intended recipient(s) and are confidential or
          proprietary to SYSTAP. Any unauthorized review, use, disclosure,
          dissemination or copying of this email or its contents or attachments is
          prohibited. If you have received this communication in error, please notify
          the sender by reply email and permanently delete all copies of the email
          and its contents and attachments.

          On Mon, Aug 18, 2014 at 9:24 AM, fancy fancyerii@users.sf.net wrote:

          Thank you. is there articles about comparsions between bigdata and other
          graph database? Why bigdata is performant and scalable? it it built on top
          of something like bigtable(hbase/cassandra)? Any introduction resources of
          its implementation details?


          why bigdata is not so well known as neo4j or other graph database?
          https://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/bigdata/discussion/676946/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/

           
          • Bryan Thompson

            Bryan Thompson - 2014-08-18

            Here is a version of that table as an image. The formatting did not come
            through. Hopefully this will.

            [image: Inline image 1]

            Bryan

            There are several factors when choosing a graph database. Most
            applications are read-mostly, but they can differ quite a bit on the query
            and update workloads.

            1.

            Scaling story
            2.

            Load throughput
            3.

            Graph traversal operations (parallel graph algorithms like shortest
            paths)
            4.

            Graph query operations (high-level, high-performance query language)

            Platform performance comparison

            graph platform

            load (s)

            query (ms)

            comments

            titan

            497.00

            935

            4 node cluster using Cassandra

            neo4j

            608.00

            668

            single node community edition

            bigdata

            396.00

            281

            single node (open source)

            MapGraph

            0.08

            27

            NVIDIA K20 GPU using larger scale free graph (24M vertices and 25M edges)

            The data was a scale-free graph with 2.7M vertices and 5.6M edges (except
            for MapGraph, see above). The query was to identify a 5-degree subgraph
            (depth-limited BFS). The MapGraph result is for a single GPU.

            The performance can obviously vary quite a bit depending on the data set,
            how the information is modeled, and the query workload. I think that there
            are two main takeaways from this.

            -

            First, the horizontal scaling for titan is very expensive, both in
            terms
            of machines and performance.
            -

            Second, GPUs are ridiculously fast,.


            Bryan Thompson
            Chief Scientist & Founder
            SYSTAP, LLC
            4501 Tower Road
            Greensboro, NC 27410
            bryan@systap.com
            http://bigdata.com
            http://mapgraph.io

            CONFIDENTIALITY NOTICE: This email and its contents and attachments are
            for the sole use of the intended recipient(s) and are confidential or
            proprietary to SYSTAP. Any unauthorized review, use, disclosure,
            dissemination or copying of this email or its contents or attachments is
            prohibited. If you have received this communication in error, please notify
            the sender by reply email and permanently delete all copies of the email
            and its contents and attachments.

            On Mon, Aug 18, 2014 at 9:24 AM, fancy fancyerii@users.sf.net wrote:

            Thank you. is there articles about comparsions between bigdata and other
            graph database? Why bigdata is performant and scalable? it it built on top
            of something like bigtable(hbase/cassandra)? Any introduction resources of
            its implementation details?


            why bigdata is not so well known as neo4j or other graph database?

            https://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07

            Sent from sourceforge.net because you indicated interest in
            https://sourceforge.net/p/bigdata/discussion/676946/

            To unsubscribe from further messages, please visit
            https://sourceforge.net/auth/subscriptions/


            why bigdata is not so well known as neo4j or other graph database?
            http://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07/e5ad


            Sent from sourceforge.net because you indicated interest in
            https://sourceforge.net/p/bigdata/discussion/676946/

            To unsubscribe from further messages, please visit
            https://sourceforge.net/auth/subscriptions/

             
            • Bryan Thompson

              Bryan Thompson - 2014-08-18

              Try this link:
              https://www.dropbox.com/s/lj3nabf3mul44jq/Graph%20Database%20Use%20Case.pdf

              That should work.

              Bryan


              Bryan Thompson
              Chief Scientist & Founder
              SYSTAP, LLC
              4501 Tower Road
              Greensboro, NC 27410
              bryan@systap.com
              http://bigdata.com
              http://mapgraph.io

              CONFIDENTIALITY NOTICE: This email and its contents and attachments are
              for the sole use of the intended recipient(s) and are confidential or
              proprietary to SYSTAP. Any unauthorized review, use, disclosure,
              dissemination or copying of this email or its contents or attachments is
              prohibited. If you have received this communication in error, please notify
              the sender by reply email and permanently delete all copies of the email
              and its contents and attachments.

              On Mon, Aug 18, 2014 at 9:42 AM, Bryan Thompson thompsonbry@users.sf.net
              wrote:

              Here is a version of that table as an image. The formatting did not come
              through. Hopefully this will.

              [image: Inline image 1]

              Bryan

              There are several factors when choosing a graph database. Most
              applications are read-mostly, but they can differ quite a bit on the query
              and update workloads.

              1.

              Scaling story
              2.

              Load throughput
              3.

              Graph traversal operations (parallel graph algorithms like shortest
              paths)
              4.

              Graph query operations (high-level, high-performance query language)

              Platform performance comparison

              graph platform

              load (s)

              query (ms)

              comments

              titan

              497.00

              935

              4 node cluster using Cassandra

              neo4j

              608.00

              668

              single node community edition

              bigdata

              396.00

              281

              single node (open source)

              MapGraph

              0.08

              27

              NVIDIA K20 GPU using larger scale free graph (24M vertices and 25M edges)

              The data was a scale-free graph with 2.7M vertices and 5.6M edges (except
              for MapGraph, see above). The query was to identify a 5-degree subgraph
              (depth-limited BFS). The MapGraph result is for a single GPU.

              The performance can obviously vary quite a bit depending on the data set,
              how the information is modeled, and the query workload. I think that there
              are two main takeaways from this.

              -

              First, the horizontal scaling for titan is very expensive, both in
              terms
              of machines and performance.
              -

              Second, GPUs are ridiculously fast,.

              Bryan Thompson
              Chief Scientist & Founder
              SYSTAP, LLC
              4501 Tower Road
              Greensboro, NC 27410
              bryan@systap.com
              http://bigdata.com
              http://mapgraph.io

              CONFIDENTIALITY NOTICE: This email and its contents and attachments are
              for the sole use of the intended recipient(s) and are confidential or
              proprietary to SYSTAP. Any unauthorized review, use, disclosure,
              dissemination or copying of this email or its contents or attachments is
              prohibited. If you have received this communication in error, please notify
              the sender by reply email and permanently delete all copies of the email
              and its contents and attachments.

              On Mon, Aug 18, 2014 at 9:24 AM, fancy fancyerii@users.sf.net wrote:

              Thank you. is there articles about comparsions between bigdata and other
              graph database? Why bigdata is performant and scalable? it it built on top
              of something like bigtable(hbase/cassandra)? Any introduction resources of
              its implementation details?


              why bigdata is not so well known as neo4j or other graph database?

              https://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07

              Sent from sourceforge.net because you indicated interest in
              https://sourceforge.net/p/bigdata/discussion/676946/

              To unsubscribe from further messages, please visit
              https://sourceforge.net/auth/subscriptions/


              why bigdata is not so well known as neo4j or other graph database?

              http://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07/e5ad

              Sent from sourceforge.net because you indicated interest in
              https://sourceforge.net/p/bigdata/discussion/676946/

              To unsubscribe from further messages, please visit
              https://sourceforge.net/auth/subscriptions/


              why bigdata is not so well known as neo4j or other graph database?
              http://sourceforge.net/p/bigdata/discussion/676946/thread/541b747a/?limit=25#ff19/d301/9c07/e5ad/c424


              Sent from sourceforge.net because you indicated interest in
              https://sourceforge.net/p/bigdata/discussion/676946/

              To unsubscribe from further messages, please visit
              https://sourceforge.net/auth/subscriptions/

               
        • Bryan Thompson

          Bryan Thompson - 2014-08-18

          There is a 30-40 page book length chapter [1] linked from our blog [2] along with a number of other resources.

          Due to the lack of standards, there is very little benchmarking across graph databases. The LDBC is trying to change that. We publish SPARQL benchmarks on our blog from time to time. There is a benchmarking guide on the wiki (wiki.bigdata.com).

          I will send along some performance comparisons separately.

          Thanks,
          Bryan
          [1] http://www.bigdata.com/whitepapers/bigdata_architecture_whitepaper.pdf
          [2] http://blog.bigdata.com/

          On Aug 18, 2014, at 9:24 AM, "fancy" fancyerii@users.sf.net wrote:

          Thank you. is there articles about comparsions between bigdata and other graph database? Why bigdata is performant and scalable? it it built on top of something like bigtable(hbase/cassandra)? Any introduction resources of its implementation details?

          why bigdata is not so well known as neo4j or other graph database?

          Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

          To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

           
          • fancy

            fancy - 2014-08-19

            I have read it roughly. I think it has similar structure as titan because both leverage bigtable like k-v system as its backend. But bigdata implement by itself while titan use cassandra/hbase directly.
            As this blog(http://blog.bigdata.com/?p=711) says: "I suppose the key difference between Titan and Bigdata might be Bigdata’s query optimizer"
            I think this is difference between sparql(sql like) and gremlin.
            sparql(sql) is declarative language and gremlin is imperative language.
            So query planer/optimizer is neccessary for sparql.
            So I am strange about the poor scalability of titan. What's the titan/cassandra version in this comparasion?

            Also, I think bigdata is more related to semantic web community while other graph database such as titan, neo4j more to nosql community.
            tinkerpop stack supports are added recently by bigdata.

            IMHO, I don't like semantic web related things because I think they concern much more about theory but less about real usage. They care about the expressive power of different languages(RDFs/OWL full,dl,lite...), care bout inference. But don't care about scalability and real data. They play with toy dataset and publish papers and that's all.

            In real work, they are many dirty works they don't care but it's important.
            1. human readable name to uri
            although more and more structural data become available such as wikipedia/dbpedia. The large majority data is unstructured or only semi-structured. We need Search/NLP/Text Mining/Machine learning to disambiguate string. Or Recognition entities by string.
            I am glad to see bigdata extend sparql to add support for FullTextSearch http://wiki.bigdata.com/wiki/index.php/FullTextSearch
            Titan use lucene/ElasticSearch and neo4j use lucene. Does bigdata use
            its own full text search engine? Does it extendable? e.g. I want to provide my own analyzer(tokenizer) for Chinese.

            1. scalability
              old semantic libs such as jena and sesame do not care about scalability. I am glad to know bigdata take this into consideration in the designing stage.

            2. probabilistic reasoning/machine learning
              most researchers of semantic web come from computer science, logic.
              I do think logic/set theory is important. But recently, statistics(machine learning) are much more active than symbolism(logic).
              After all, our data can be contradicted and all things are uncertained.
              I know there is pr-owl project but not very active.
              The "big data" community care more about machine learning. There are many projects related machine learning and (realtime)big data such as mahout/pregel/giraph/spark/graphlab/storming

            3. too much concepts and hard to understand
              xml is boring(although there are turtle or other things)

            4. no real world application
              people studying semantic web ask too much from users but give back fewer. To use semantic web, people need to learn so much concepts. But They do not improve current system much.

            5. make data machine readable step by step
              The goal of semantic web is too hard(or impossible) to achieve. rss/wiki takes small steps but it's useful. Also I think it need more "dirty" workers to use NLP/ML to construct more structure dataset such as dbpedia/freebase/concetnet

             
            • Bryan Thompson

              Bryan Thompson - 2014-08-19

              Bigdata is about data. We emphasize scalable graph data over everything else. We do not go in for highly expressive inference layers that are computationally intractable and not scalable.

              Declarative languages allow you to write query optimizes. This makes a huge difference in performance and is the main reason that SQL broke loose from competitions technologies (ISAM, etc.). Stonebreaker has a very relevant article from several years back about these issues. http://people.csail.mit.edu/tdanford/6830papers/stonebraker-what-goes-around.pdf

              I suggest you try some performance comparisons. Without a declarative layer the programmer has to code and optimize each query by hand. And the optimal solution depends on data skew and that depends on the actual bindings for the variables in the query. A query optimizer can do this automatically and can automatically pick better query plans as new database revisions role out.

              The main feature lacking in Cassandra, HBase, accumulo, etc. Is the ability to execute user code on the tablet servers. This means that it is not possible to write a distributed query evaluation engine that executes the queries local to the data. That forces you to materialize the data for the access paths on the client. This approach reads much more from the disk and slams the network and the client with those data, much of which will then be filtered out by join processing on the client.

              This was the current version of Titan from February.

              Yes. We use the lucene tokenizers. Chinese is automatically supported if you used language coded Literals. You can also override the analyzers.

              To understand a thing, tear it down and look at what it does. Bigdata provides scalable graph data processing for disk based systems with indices and query optimizers. MapGraph provides extreme performance for graph structured data with a lower level vertex-centric API.

              Thanks,
              Bryan

              On Aug 19, 2014, at 12:02 AM, "fancy" fancyerii@users.sf.net wrote:

              I have read it roughly. I think it has similar structure as titan because both leverage bigtable like k-v system as its backend. But bigdata implement by itself while titan use cassandra/hbase directly.
              As this blog(http://blog.bigdata.com/?p=711) says: "I suppose the key difference between Titan and Bigdata might be Bigdata’s query optimizer"
              I think this is difference between sparql(sql like) and gremlin.
              sparql(sql) is declarative language and gremlin is imperative language.
              So query planer/optimizer is neccessary for sparql.
              So I am strange about the poor scalability of titan. What's the titan/cassandra version in this comparasion?

              Also, I think bigdata is more related to semantic web community while other graph database such as titan, neo4j more to nosql community.
              tinkerpop stack supports are added recently by bigdata.

              IMHO, I don't like semantic web related things because I think they concern much more about theory but less about real usage. They care about the expressive power of different languages(RDFs/OWL full,dl,lite...), care bout inference. But don't care about scalability and real data. They play with toy dataset and publish papers and that's all.

              In real work, they are many dirty works they don't care but it's important.
              1. human readable name to uri
              although more and more structural data become available such as wikipedia/dbpedia. The large majority data is unstructured or only semi-structured. We need Search/NLP/Text Mining/Machine learning to disambiguate string. Or Recognition entities by string.
              I am glad to see bigdata extend sparql to add support for FullTextSearch http://wiki.bigdata.com/wiki/index.php/FullTextSearch
              Titan use lucene/ElasticSearch and neo4j use lucene. Does bigdata use
              its own full text search engine? Does it extendable? e.g. I want to provide my own analyzer(tokenizer) for Chinese.

              scalability
              old semantic libs such as jena and sesame do not care about scalability. I am glad to know bigdata take this into consideration in the designing stage.

              probabilistic reasoning/machine learning
              most researchers of semantic web come from computer science, logic.
              I do think logic/set theory is important. But recently, statistics(machine learning) are much more active than symbolism(logic).
              After all, our data can be contradicted and all things are uncertained.
              I know there is pr-owl project but not very active.
              The "big data" community care more about machine learning. There are many projects related machine learning and (realtime)big data such as mahout/pregel/giraph/spark/graphlab/storming

              too much concepts and hard to understand
              xml is boring(although there are turtle or other things)

              no real world application
              people studying semantic web ask too much from users but give back fewer. To use semantic web, people need to learn so much concepts. But They do not improve current system much.

              make data machine readable step by step
              The goal of semantic web is too hard(or impossible) to achieve. rss/wiki takes small steps but it's useful. Also I think it need more "dirty" workers to use NLP/ML to construct more structure dataset such as dbpedia/freebase/concetnet

              why bigdata is not so well known as neo4j or other graph database?

              Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

              To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

               
              • fancy

                fancy - 2014-08-20

                Thank you. If I want to use it as a graphdatabase with blueprint/gremlin interface. My query is also optimized by bigdata?
                BTW, Is there any get started tutorial about this? I found the wiki is not easy to read. And I can't find any document about MapGraph

                 
                • Bryan Thompson

                  Bryan Thompson - 2014-08-20

                  MapGraph is documented at MapGraph.io. Zhisong (CC) can help if you have questions about that platform.

                  These is getting started information for blueprints and gremlin on the bigdata.com website

                  http://www.bigdata.com/blueprints

                  Bigdata does not yet optimize much of the gremlin layer. There is only a limited capability to do this so far in the implementation stack. We will be optimizing the vertex programs in tinkerpop3 and mapping them directly onto the bigdata GASService, which provides fast native evaluation of the Gather Apply Scatter vertex programming model against the bigdata database. Mike (CC) is the lead person for the blueprints, gremlin, and rexter integrations.

                  MapGraph vertex programs are written in sequential C functions that are then invoked from CUDA kernels. We are looking at how to integrate MapGraph with tinkerpop3, but this is significantly more complicated. We are considering using avro to interchange data with MapGraph and then converting from array of structures (avro records) to the more memory bandwidth efficient structure of arrays pattern used by MapGraph (which is similar to what column stores use).

                  Thanks,
                  Bryan

                  On Aug 19, 2014, at 10:17 PM, "fancy" fancyerii@users.sf.net wrote:

                  Thank you. If I want to use it as a graphdatabase with blueprint/gremlin interface. My query is also optimized by bigdata?
                  BTW, Is there any get started tutorial about this? I found the wiki is not easy to read. And I can't find any document about MapGraph

                  why bigdata is not so well known as neo4j or other graph database?

                  Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/bigdata/discussion/676946/

                  To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

                   

Log in to post a comment.