State of Spark support?

Linking Language to Knowledge with Distributional Semantics

Status: Beta

Brought to you by: apanchenko, biem-tuda, coppolab, eugenso, and 4 others

State of Spark support?

Forum: General Discussion

Creator: Joe Bauer

Created: 2015-12-04

Updated: 2015-12-08

Joe Bauer - 2015-12-04

Dear Sir or Madam,

I'm about to start working on the integration of JoBim Text into an existing QA pipeline (open source, university research) and would like to directly use Spark instead of Hadoop. I remember that Dr. Gliozzo mentioned in an IBM lecture that migrating to Spark [regarding distributional semantics] is / was also a goal of the Watson team. Unfortunately, I'm not too experienced in either Hadoop or Spark, yet.

Is it possible for me to directly target Spark or is Hadoop still my only really viable option? From a first glance at the tutorials it seems like they are still rather focused on Hadoop and also the VM that is provided seems to support Hadoop only.

Finally: Are there any special hints, starting points or other recommendations I should be aware of before I tackle this task? I currently have quite a few sources at hand including the books by Prof. Dr. Biemann and Dr. Gliozzo as well as all relevant papers, but my schedule is rather tight so I would highly appreciate any advice you can provide me with.

Thank you very much in advance.

Best wishes,
Joe

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joe Bauer - 2015-12-06

OK, apparently Spark fully supports Hadoop's InputFormat, so it shouldn't be an issue to use it directly. Is that correct? Are there any pitfalls I should still watch out for?

Last edit: Joe Bauer 2015-12-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eugen Ruppert - 2015-12-08

Hello Joe,

concerning the Spark implementation: Alexander (https://github.com/orgs/tudarmstadt-lt/people/alexanderpanchenko) and Gerold (https://github.com/orgs/tudarmstadt-lt/people/hintz) have some implementations of the JoBimText pipeline for Spark. Please ask them for details.

The Hadoop pipeline is quite stable and works on large data collections. For step-by-step instructions, you can refer to the JoBimText tutorial slides (Part 2 is the practice part):
https://sites.google.com/site/jobimtexttutorial/resources

If you have additional questions, you can write me directly and I'll help you.

Best,

Eugen

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eugen Ruppert - 2015-12-08

Sorry for the wrong GitHub user links:
This is for Gerold: https://github.com/hintz
Andt this for Alexander: https://github.com/alexanderpanchenko

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Riedl - 2015-12-08

Hello Joe,

we have some SPARK code available that performs the DT computation. There seem to be some issues when processing larger amounts of data using that implementation. So I would advice using the regular Hadoop implementation for larger amount of data. The SPARK implementation is available at:
https://github.com/tudarmstadt-lt/noun-sense-induction-scala

Regards,
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.