Hey,

I would still keep JSON. It is already much better than the default Pig output. However, I think at least for the DB Import part, no full JSON parser will give very good performance. I tried both Lift and Jackson, which is supposed to be one of the fastest JSON parsers for Java, and both were not very satisfying. As ugly as that seems, but just splitting the String with a regular expression was much faster for me:

tokens.tail.init.split("(\\[\"|\",|\"\\])").filter(pair => !pair.equals(",") && !pair.equals("")).grouped(2)

I think, as long as we stick to standard JSON, that should be okay. I am writing some tests for this part at the moment.

Jo


On Wed, Aug 15, 2012 at 3:40 AM, Pablo N. Mendes <pablomendes@gmail.com> wrote:

Oh, I should have thought about that. In my suggestion to use lift-json "everywhere" I was completely ignoring performance. I had only the rest module in mind. So ease of parsing and generation of JSON was the only thing I considered. 

Of course, I forgot that now we have json for indexing too. Good to know that lift-json is not performing fast enough. We surely could use very efficient approaches. 

Chris, Jo, it would be worth thinking of ways that the Pig output can be generated such that the input can be parsed easily and quickly. We don't need to stick to JSON necessarily. Ideas?

Cheers,
Pablo


On Tue, Aug 14, 2012 at 1:24 AM, Chris Hokamp <chris.hokamp@gmail.com> wrote:
> I had used lift-json before but it was too slow

Yes, it's really slow for me too. Will probably end up switching back once I get things figured out.

Cheers,
Chris


On Mon, Aug 13, 2012 at 7:10 PM, Joachim Daiber <daiber.joachim@gmail.com> wrote:
Hey,

Great thanks. I'll update my stuff and post the new results ASAP. I had modified TokenOccurrenceSource [1] to load the index with lift-json because ImportPig wasn't working for me.

Cool, thanks. I had used lift-json before but it was too slow, so I switched to doing it by hand (it should work well with standard JSOn, maybe my version on Github was old). I will try to merge your additions and make it an option.

Best,
Jo




On Mon, Aug 13, 2012 at 6:02 PM, Chris Hokamp <chris.hokamp@gmail.com> wrote:
Hi Jo,

Great thanks. I'll update my stuff and post the new results ASAP. I had modified TokenOccurrenceSource [1] to load the index with lift-json because ImportPig wasn't working for me.

Thanks,
Chris 


On Mon, Aug 13, 2012 at 6:00 PM, Joachim Daiber <daiber.joachim@gmail.com> wrote:
Hey Chris,

sorry for the delay, I had to fix some bugs.

The latest version is here:


This one includes page titles and redirects. The redirects don't seem to help, you can disable them in ImportPig if you want to 'index' yourself.

You can get the resources, SFs and candidate map here:



Using the DBCandidateSearcher, you should get:

Corpus: MilneWitten
Correct URI not found = 43 / 638 = 0.067

Corpus: AnnotatedTextSource
Correct URI not found = 618 / 10452 = 0.059

Corpus: AIDA
Correct URI not found = 2130 / 23299 = 0.091

Best,
Jo


On Sun, Aug 12, 2012 at 10:09 PM, Chris Hokamp <chris.hokamp@gmail.com> wrote:
Hi Jo,

I did use the SurfaceFormStore. If you could push the version with normalizing that would be great - I'll use it in the next tests. 

Thanks,
Chris

On Sun, Aug 12, 2012 at 12:09 AM, Joachim Daiber <daiber.joachim@gmail.com> wrote:
Hey Chris, Max,

if you used the SurfaceFormStore, the high number of URIs not found is because
there was no normalization in there yet (not even lowercasing). I have added SF normalization
it to my version (but haven't pushed everything yet), it changes the numbers quite a bit. If you
want to use it, I can push it and send you the link.


There is a lot of Logging going on, which means the times are in reality a bit lower.


********************
Corpus: MilneWitten
Number of occs: 706 (original), 638 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture)
Correct URI not found = 44 / 638 = 0.069
Accuracy = 541 / 638 = 0.848
Global MRR: 0.7945742383347648
Elapsed time: 12 sec
********************

********************
Corpus: AnnotatedTextSource
Number of occs: 12099 (original), 10452 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture)
Correct URI not found = 652 / 10452 = 0.062
Accuracy = 7570 / 10452 = 0.724
Global MRR: 0.6943953819811728
Elapsed time: 136 sec
********************

********************
Corpus: AIDA
Number of occs: 34929 (original), 23299 (processed)
Disambiguator: Database-backed 2 Step disambiguator (GenerativeContextSimilarity, UnweightedMixture)
Correct URI not found = 2646 / 23299 = 0.114
Accuracy = 17394 / 23299 = 0.747
Global MRR: 0.5252856600163418
Elapsed time: 485 sec
********************


Best,
Jo





On Sat, Aug 11, 2012 at 11:03 PM, Max Jakob <max.jakob@gmail.com> wrote:
Hi Chris,

On Sat, Aug 11, 2012 at 1:49 PM, Chris Hokamp <chris.hokamp@gmail.com> wrote:
> I was finally able to run the Explicit Semantic Analysis disambiguator on
> the whole wikidump. Here are the results for Milne-Witten:
>
> ********************
> Disambiguator: Database-backed 2 Step ESA Disambiguator
> Correct URI not found = 151 / 706 = 0.214
> Accuracy = 500 / 706 = 0.708
> Global MRR: 0.7356721170226624
> Elapsed time: 625 sec
> ********************

The number of cases where the correct URI was not found is in fact a
bit high. There is definitely an issue with the surface form mapping.
If we disregard this, the accuracy is quite good!

Generally, the merging of all the approaches into one system that for
sure deals with the same data still lies ahead of us.

@all: Could you give Chris some pointers which data you used for the
surface form to URI candidate mapping? Most of you had a lower number
of unfound URIs.


> (1) ESA represents each resource using a vector of resources, where the
> weight in each cell represents the 'relatedness' between this resource and
> another resource [1]. Thus, as the system is scaled up, the potential length
> of each vector also increases linearly (of course I'm not storing if the
> value is 0). I solved this problem by extracting only the top 100 tfidf
> tokens from each document, and running ESA indexing using these only. Each
> resource vector is now much more sparse than it would be if indexing were
> run on the entire corpus with no filtering.

Scaling up on the size of the English Wikipedia is in fact not that
easy. But filtering is of course allowed and often desired regardless
of scaling issues. Disregarding tokens that do not occur at least
twice or three times with a resource can be regarded as noise and can
be thrown away. Since you had this filter already in the first
PigLatin script, I suppose this stop was not enough in this case.

Unfortunately, my knowledge of ESA is very limited. Therefore, I can't
really give advice on how to optimize the implementation. At this
point in time, I can live with the pragmatic filtering measure you
took.
The most important thing is to evaluate both of your approaches with
as CSAW and M&W and get a the code in good shape (refactoring and
documentation).


> (2) I have two indexes that need to be persisted: the inverted index [2],
> and the EsaVectorStore [3]. I've spent a lot of time with Kryo [4] and JDBM
> [5] trying to make this work, but the nested map structures are difficult to
> serialize, and I can't seem to get it right. Right now I'm creating these on
> the fly, but running the whole thing takes ~8hrs, so it's really a waste
> just to test disambiguation.

Did you try out DiskContextStore from Jo's repo? It maps resources
onto a map. For his implementation, resources and tokens need to have
integer indexes, but you could change this (especially when
debugging).
https://github.com/jodaiber/dbpedia-spotlight-db/blob/master/core/src/main/scala/org/dbpedia/spotlight/db/disk/DiskContextStore.scala


Cheers,
Max

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbp-spotlight-developers mailing list
Dbp-spotlight-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-developers







------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Dbp-spotlight-developers mailing list
Dbp-spotlight-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-developers




--
---
Pablo N. Mendes