Menu

State of Spark support?

Joe Bauer
2015-12-04
2015-12-08
  • Joe Bauer

    Joe Bauer - 2015-12-04

    Dear Sir or Madam,

    I'm about to start working on the integration of JoBim Text into an existing QA pipeline (open source, university research) and would like to directly use Spark instead of Hadoop. I remember that Dr. Gliozzo mentioned in an IBM lecture that migrating to Spark [regarding distributional semantics] is / was also a goal of the Watson team. Unfortunately, I'm not too experienced in either Hadoop or Spark, yet.

    Is it possible for me to directly target Spark or is Hadoop still my only really viable option? From a first glance at the tutorials it seems like they are still rather focused on Hadoop and also the VM that is provided seems to support Hadoop only.

    Finally: Are there any special hints, starting points or other recommendations I should be aware of before I tackle this task? I currently have quite a few sources at hand including the books by Prof. Dr. Biemann and Dr. Gliozzo as well as all relevant papers, but my schedule is rather tight so I would highly appreciate any advice you can provide me with.

    Thank you very much in advance.

    Best wishes,
    Joe

     
  • Joe Bauer

    Joe Bauer - 2015-12-06

    OK, apparently Spark fully supports Hadoop's InputFormat, so it shouldn't be an issue to use it directly. Is that correct? Are there any pitfalls I should still watch out for?

     

    Last edit: Joe Bauer 2015-12-06
  • Eugen Ruppert

    Eugen Ruppert - 2015-12-08

    Hello Joe,

    concerning the Spark implementation: Alexander (https://github.com/orgs/tudarmstadt-lt/people/alexanderpanchenko) and Gerold (https://github.com/orgs/tudarmstadt-lt/people/hintz) have some implementations of the JoBimText pipeline for Spark. Please ask them for details.

    The Hadoop pipeline is quite stable and works on large data collections. For step-by-step instructions, you can refer to the JoBimText tutorial slides (Part 2 is the practice part):
    https://sites.google.com/site/jobimtexttutorial/resources

    If you have additional questions, you can write me directly and I'll help you.

    Best,

    Eugen

     
  • Eugen Ruppert

    Eugen Ruppert - 2015-12-08

    Sorry for the wrong GitHub user links:
    This is for Gerold: https://github.com/hintz
    Andt this for Alexander: https://github.com/alexanderpanchenko

     
  • Martin Riedl

    Martin Riedl - 2015-12-08

    Hello Joe,

    we have some SPARK code available that performs the DT computation. There seem to be some issues when processing larger amounts of data using that implementation. So I would advice using the regular Hadoop implementation for larger amount of data. The SPARK implementation is available at:
    https://github.com/tudarmstadt-lt/noun-sense-induction-scala

    Regards,
    Martin

     

Log in to post a comment.