From: SourceForge.net <no...@so...> - 2009-11-06 09:33:34
|
Bugs item #2891882, was opened at 2009-11-04 20:39 Message generated for change (Comment added) made by aditsu You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=113153&aid=2891882&group_id=13153 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Tidy functionality Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Alex Kainov (alexkainov) Assigned to: Adrian Sandor (aditsu) Summary: Incorrect parsing of <td> attributes Initial Comment: INPUT: <td width="14%" bgcolor="#008000"> TIDY OUTPUT: <td width="14%" bgcolor="#008000"> DESIRED OUTPUT: <td style="width:14%;background-color:#008000;"> Error: The tag: "td" doesn't have an attribute: "width" in currently active versions. The tag: "td" doesn't have an attribute: "bgcolor" in currently active versions. Found in version: r918 ---------------------------------------------------------------------- >Comment By: Adrian Sandor (aditsu) Date: 2009-11-06 17:33 Message: Hi, I asked for the input, but you didn't provide it, you only explained how you obtained it. You're also using TOOOOOOOOOO many steps to convert the input before passing it to JTidy, but that shouldn't affect the processing of the td tag you showed. You don't need any .NET stuff. You can find Tidy at http://tidy.sourceforge.net/ and http://sourceforge.net/projects/tidy ---------------------------------------------------------------------- Comment By: Alex Kainov (alexkainov) Date: 2009-11-06 16:58 Message: Hi ! Thanks for the answer ! Well, I use URL u.openStream() as input for the parser: URL u = new URL(url + doc.getUniversalID()); BufferedReader in = new BufferedReader( new InputStreamReader( u.openStream(),"UTF-8") ); String s; StringBuffer htmlStr = new StringBuffer(); while( (s = in.readLine()) != null){ htmlStr.append(s); } String htmlString = htmlStr.toString(); The code is quite simple: Tidy tidy= new Tidy(); // obtain a new Tidy instance tidy.setDocType("strict"); tidy.setDropFontTags(true); tidy.setFixBackslash(true); tidy.setFixUri(true); tidy.setJoinClasses(true); tidy.setJoinStyles(true); tidy.setLogicalEmphasis(true); tidy.setQuiet(true); tidy.setQuoteMarks(true); tidy.setShowWarnings(false); tidy.setTidyMark(false); tidy.setXHTML(true); tidy.setInputEncoding("UTF8"); tidy.setOutputEncoding("UTF8"); byte currentXMLBytes[] = htmlString.getBytes("UTF-8"); ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(currentXMLBytes); ByteArrayOutputStream byteArrayOutputStream= new ByteArrayOutputStream(); tidy.parse(byteArrayInputStream, byteArrayOutputStream); String sBuffer= byteArrayOutputStream.toString("UTF-8"); Concerning your proposal of using tidy (the C program). I've found only links to the program for .NET. .NETis not installed on my computer. Any idea ? Regards, Alex. ---------------------------------------------------------------------- Comment By: Adrian Sandor (aditsu) Date: 2009-11-06 10:32 Message: I don't get those errors, maybe you haven't included the whole input (especially the doctype). You also haven't provided the code. Anyway, check if tidy (the C program) behaves differently. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=113153&aid=2891882&group_id=13153 |