Re: Fwd: Re: [Archive-access-discuss] Re: nutchwax

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>
>
>
>
> ---------- P=F8eposlan=E1 zpr=E1va ----------
> From: stack <st...@ar...>
> To: stack <st...@ar...>
> Date: Sat, 11 Feb 2006 10:59:07 -0800
> Subject: Re: [Archive-access-discuss] Re: nutchwax
> uk=E1=B9:
>
> I committed code to undo any html entity encoding found in text to be
> emitted by OpenSearchServlet.  I committed on the nutchwax 'release-0_4'
> branch so be careful you get this branch from CVS rather than HEAD if
> building from source.  If you just want the WAR with the fix, its
> available here: http://archive.org/~stack/nutchwax.war.  Let me know if
> you want me to make up a complete nutchwax tarball.  Let me know if the
> fix works for you (Here's the bug:
> https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1429788&gro=
up_id=3D
>118427&atid=3D681137 ).

it works very well! good work.
I've just downloaded nutchwax.war and .. it seems to be ok:)

=2Dlm

>
> This is a band-aid fix until the core issue gets addressed in nutch.
> I'll work on trying to get this done this week.
>
> This is a pretty serious issue.  Text snippets -- i.e. the 'description'
> field in the XML -- that have anything but plain ASCII are mangled
> showing ugly numeric character representations, '&#343;', etc., in place
> of legit UTF-8 characters.  Its was also possible to by-pass our
> legit-xml character checking encoding illegal characters: e.g. '&#7;'.
> If the fix works for you Luk=E1=B9, I'll make a new release of nutchwax w=
ith
> the bandaid incorporated later this week (Hopefully by the release of
> the 0.6.0 mapreduce version of NutchWAX, will have the real fix
> incorporated).
>
> Good stuff,
> St.Ack
>
> stack wrote:
> > Lukas Matejka wrote:
> >> Dne =E8t 9. =FAnora 2006 18:51 stack napsal(a):
> >>> ....
> >>> I see the 0x07 Bell character in the original page.  Below is an 'od'
> >>> dump of the relevant section with the ascii line underwritten by its
> >>> hex
> >>> representation.  The last line has the 0x07 character.
> >>
> >> you're absolutely right with bell character, but i think there is one
> >> another different thing. I'll try to explain.
> >>
> >> i will search word 'kniha' (which means book) trough
>
> http://war.mzk.cz:8080/nutchwax/opensearch?query=3Dkniha&start=3D0&hitsPe=
rPage=3D
>10&hitsPerDup=3D1&dedupField=3Dexacturl
>
> >> answer is valid XML(57472 hits of word kniha), but in result
> >> in entity 'description' there are html entites that represent czech
> >> characters with diacritics and that's the problem. Original site
> >> doesn't contain these html entities but regular czech characters.
> >>
> >> Interesting thing is that in entity 'title' shows czech characters
> >> well, but in entity 'description' like html entites(for instance html
> >> entity &yacute; represents special character y with dash).
> >>
> >> Have you any idea where could be problem?
> >
> > Thanks for the extra info Lukas.
> >
> > Digging in, I see that the generation of summaries runs the text
> > through org.apache.nutch.html.Entities.  Here is the code for the
> > Entities#encode method that all summary text is run through:
> >
> >  static final public String encode(String s) {
> >    int length =3D s.length();
> >    StringBuffer buffer =3D new StringBuffer(length * 2);
> >    for (int i =3D 0; i < length; i++) {
> >      char c =3D s.charAt(i);
> >      int j =3D (int)c;
> >      if (j < 0x100 && encoder[j] !=3D null) {
> >    buffer.append(encoder[j]);        // have a named encoding
> >    buffer.append(';');
> >      } else if (j < 0x80) {
> >    buffer.append(c);             // use ASCII value
> >      } else {
> >    buffer.append("&#");              // use numeric encoding
> >    buffer.append((int)c);
> >    buffer.append(';');
> >      }
> >    }
> >    return buffer.toString();
> >  }
> >
> > Any character that is super-ASCII gets a numeric character encoding.
> > Assuming all is UTF-8 in nutch, then we probably don't want HTML
> > entity encoding when we're outputtting UTF-8 XML.  In fact, looks like
> > we don't want any html entity encoding at all when outputting XML.
> >
> > The call to Entities#encode is buried in nutch inside the Fragment
> > inner class of Summary.  It would take a good bit of work making up a
> > NutchBean that called an alternate Summary-maker when outputting XML.
> >
> > Meantime, I have a quick fix that adds HTML entity decoding to the
> > Nutchwax OpenSearchServlet.  Let me do some more testing and hopefully
> > I can commit later today. I'll let you know.
> >
> > St.Ack
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> > files
> > for problems?  Stop!  Download the new AJAX search engine that makes
> > searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=103432&bid#0486&dat=121642
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmdlnk&kid=103432&bid#0486&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

=2D-=20
=2D-----------------------------
Bc.Lukas Matejka
email:mat...@ce...
GSM:+420777093233