I'm about to start working on the integration of JoBim Text into an existing QA pipeline (open source, university research) and would like to directly use Spark instead of Hadoop. I remember that Dr. Gliozzo mentioned in an IBM lecture that migrating to Spark [regarding distributional semantics] is / was also a goal of the Watson team. Unfortunately, I'm not too experienced in either Hadoop or Spark, yet.
Is it possible for me to directly target Spark or is Hadoop still my only really viable option? From a first glance at the tutorials it seems like they are still rather focused on Hadoop and also the VM that is provided seems to support Hadoop only.
Finally: Are there any special hints, starting points or other recommendations I should be aware of before I tackle this task? I currently have quite a few sources at hand including the books by Prof. Dr. Biemann and Dr. Gliozzo as well as all relevant papers, but my schedule is rather tight so I would highly appreciate any advice you can provide me with.
Thank you very much in advance.
Best wishes,
Joe
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, apparently Spark fully supports Hadoop's InputFormat, so it shouldn't be an issue to use it directly. Is that correct? Are there any pitfalls I should still watch out for?
Last edit: Joe Bauer 2015-12-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
concerning the Spark implementation: Alexander (https://github.com/orgs/tudarmstadt-lt/people/alexanderpanchenko) and Gerold (https://github.com/orgs/tudarmstadt-lt/people/hintz) have some implementations of the JoBimText pipeline for Spark. Please ask them for details.
The Hadoop pipeline is quite stable and works on large data collections. For step-by-step instructions, you can refer to the JoBimText tutorial slides (Part 2 is the practice part): https://sites.google.com/site/jobimtexttutorial/resources
If you have additional questions, you can write me directly and I'll help you.
Best,
Eugen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
we have some SPARK code available that performs the DT computation. There seem to be some issues when processing larger amounts of data using that implementation. So I would advice using the regular Hadoop implementation for larger amount of data. The SPARK implementation is available at: https://github.com/tudarmstadt-lt/noun-sense-induction-scala
Regards,
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Sir or Madam,
I'm about to start working on the integration of JoBim Text into an existing QA pipeline (open source, university research) and would like to directly use Spark instead of Hadoop. I remember that Dr. Gliozzo mentioned in an IBM lecture that migrating to Spark [regarding distributional semantics] is / was also a goal of the Watson team. Unfortunately, I'm not too experienced in either Hadoop or Spark, yet.
Is it possible for me to directly target Spark or is Hadoop still my only really viable option? From a first glance at the tutorials it seems like they are still rather focused on Hadoop and also the VM that is provided seems to support Hadoop only.
Finally: Are there any special hints, starting points or other recommendations I should be aware of before I tackle this task? I currently have quite a few sources at hand including the books by Prof. Dr. Biemann and Dr. Gliozzo as well as all relevant papers, but my schedule is rather tight so I would highly appreciate any advice you can provide me with.
Thank you very much in advance.
Best wishes,
Joe
OK, apparently Spark fully supports Hadoop's InputFormat, so it shouldn't be an issue to use it directly. Is that correct? Are there any pitfalls I should still watch out for?
Last edit: Joe Bauer 2015-12-06
Hello Joe,
concerning the Spark implementation: Alexander (https://github.com/orgs/tudarmstadt-lt/people/alexanderpanchenko) and Gerold (https://github.com/orgs/tudarmstadt-lt/people/hintz) have some implementations of the JoBimText pipeline for Spark. Please ask them for details.
The Hadoop pipeline is quite stable and works on large data collections. For step-by-step instructions, you can refer to the JoBimText tutorial slides (Part 2 is the practice part):
https://sites.google.com/site/jobimtexttutorial/resources
If you have additional questions, you can write me directly and I'll help you.
Best,
Eugen
Sorry for the wrong GitHub user links:
This is for Gerold: https://github.com/hintz
Andt this for Alexander: https://github.com/alexanderpanchenko
Hello Joe,
we have some SPARK code available that performs the DT computation. There seem to be some issues when processing larger amounts of data using that implementation. So I would advice using the regular Hadoop implementation for larger amount of data. The SPARK implementation is available at:
https://github.com/tudarmstadt-lt/noun-sense-induction-scala
Regards,
Martin