Rafael Odon Alencar
As you've seen i processed the Portuguese dump of Wikipedia. Now I'm trying to setup the WikipediaMiner services up in a ubuntu linux OS. I could deploy the project correctly in tomcat with all the JARs, but i've got an issue…
When the WikipediaMinerServlet is initialized (init() method), it tries to read some XSL from disk ((help.xsl, compare.xsl, wikify.xsl and others) . But these files doesn't exists here, and the Transformer class seems to not create it from scratch in this case. Were these files supposed to be located somewhere inside the project?
Tks for the help!
Hmm, it seems the files are here in the web/xsl folder. it was hidden in my Eclipse project. So i've got to change the code in order to use a relative path, since the path informed here is not correct. I'll report it here later.
So, i've changed some lines of the WikipediaMinerServlet.java (near to line 126), in order to get the relative path of the desired XSL files. Take a look:
//get the real system path of the web/xsl folder that exists in the project path
File xslDirectory = new File(getServletContext().getRealPath("xsl")) ;
//use this folder in the "buildTransformer" method calls
transformersByName = new HashMap<String,Transformer>() ;
transformersByName.put("help", buildTransformer("help", xslDirectory, tf)) ;
transformersByName.put("loading", buildTransformer("loading", xslDirectory, tf)) ;
transformersByName.put("search", buildTransformer("search", xslDirectory, tf)) ;
transformersByName.put("compare", buildTransformer("compare", xslDirectory, tf)) ;
transformersByName.put("wikify", buildTransformer("wikify", xslDirectory, tf)) ;
It worked fine. But now i'm moving on to the other problems i have here, and that actually i dunno what is. Bye!
So, the system is up now. I've done some tests with text snippets and websites. The results seems to be acceptable, but i think there's a lot of issues related to character encoding (brasilian most common enc is iso-8859-1). I just don't know if such problems interfere also in the link detection process, or only at the presentation.
As i see, it seems a good idea to train the disambiguation and link detection models again for non-english versions, as the logic may change. Do you agree?
Tks for the toolkit development, organization and availability!!!