SourceForge has been redesigned. Learn more.
Close

Descriptions in Disambiguation Pages

Help
Anonymous
2012-03-21
2013-05-30
  • Anonymous

    Anonymous - 2012-03-21

    Hi all,

    I'm dealing with an curious issue regarding Disambiguation Pages.  I need to extract all disambiguation articles and its description within each Disambiguation Page in Wikipedia. So, for example, for the following disambiguation page http://en.wikipedia.org/wiki/Michael_Jackson_(disambiguation)  I would get the following list:

    - Michael Jackson - Michael Jackson (1958–2009) was an American pop singer, musician, songwriter, dancer, and entertainer.

    - Mick Jackson (singer) - (born 1947), British singer-songwriter, known for "Blame It on the Boogie"

    - Mike and Michelle Jackson - (born 1946), Australian children's singer, songwriter, musician, radio show hosts
    …….
    …….

    The description associated to each article in the Disambiguation Page is important for disambiguation purposes. So, the code to accomplish this must be:

    Article article = wikipedia.getArticleByTitle("Michael Jackson (disambiguation)");
    Page page = wikipedia.getPageById(article.getId());
    Map<String, String> disamMaP = new HashMap<String, String>();
    
    Article[] out = article.getLinksOut();
    
    for(int i = 0; i < out.length; i++){
            Integer[] indexes = out[i].getSentenceIndexesMentioning(article);
            for(int j = 0; j < indexes.length; j++)
                disamMaP.put(out[i].getTitle(), article.getSentenceMarkup(indexes[j]));
        }
    

    The problem is that I'm getting always an empty list of indexes. It hasn't sense at all because if an article has a link out, this link must have a SentenceIndexMentioning from the article.

    Is there al least another way to do that?

     
  • Edgar Meij

    Edgar Meij - 2012-03-22

    Hi Rafa(?),

    I can confirm the error, or at least the fact that you don't get any results. This is not surprising however.. If you'd look at page A, http://en.wikipedia.org/wiki/Michael_Jackson_%28disambiguation%29, you'll find, for example, a link to page B, http://en.wikipedia.org/wiki/Michael_A._Jackson.

    Now, if you want to retrieve the text on page B that refers to A you won't find it, since the actual text is not included in the contents of the article. The only reference is found in the top line "For other people named Michael Jackson, see Michael Jackson (disambiguation)." This is inserted by the disambiguation template at display time, however, and can't be extracted. It'd also be of limited use. Right?

    I guess what you want is either the first sentence from each ambiguous Michael Jackson, i.e., from all outlinks of page A, or the actual snippet on the disambiguation page. For the latter you need to "reverse" your code, along the lines of

      Article article = wikipedia.getArticleByTitle("Michael Jackson (disambiguation)");
      Article[] out = article.getLinksOut();
      for (int i = 0; i < out.length; i++) {
          Integer[] indexes = article.getSentenceIndexesMentioning(out[i]);
          System.out.println(out[i].getTitle());
          for (int j = 0; j < indexes.length; j++) {
            System.out.println(article.getSentenceMarkup(indexes[j]));
      }
    

    Hth,

    Edgar

     
  • Rafa Haro

    Rafa Haro - 2012-03-23

    Hi Edgar,

    Yeah, It was me  :-). I logon with an OpenId, so my name doesn't appear. Thank you very much for your support with me. I'm getting now exactly what I want. I have realized that I misunderstood the API. I thought that the method getSentenceIndexesMentioning(article) returns the sentence that the article passed by parameter links and mentions the article which do the method call.

    I never thought that maybe it could work in the totally backward mean.

    Thanks again. I'm keep going with this.

     

Log in to post a comment.