in link results in space (0x20) rather than no break space (0xC2 0xA0)
Brought to you by:
mjericho
Some real world HTML I've come across: Page has anchor links with no break spaces in, eg:
<a href='no break space.html'>no break space link</a>
When parsed, these come out as regular spaces, rather than the no break variety. I've created a test suite for this - all is well with testSpace() and testNbSpace(), but testHtmlNbSpace() fails (it passes if I expected SPACE rather than NB_SPACE):
package com.github.liamsharp;
import java.util.List;
import junit.framework.TestCase;
import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Source;
public class SpaceTests extends TestCase
{
private static final String HTML_NB_SPACE = " ";
private static final String SPACE = "\u0020";
private static final String NB_SPACE = "\u00A0";
public void testHtmlNbSpace()
{
runSpaceTest(HTML_NB_SPACE, NB_SPACE);
}
public void testSpace()
{
runSpaceTest(SPACE, SPACE);
}
public void testNbSpace()
{
runSpaceTest(NB_SPACE, NB_SPACE);
}
private void runSpaceTest(
final String inputSpace,
final String expectedOutputSpace)
{
final String content =
"<html>"
+ " <body>"
+ " <a href='before" + inputSpace + "after'>foo</a>"
+ " </body>"
+ "</html>";
final Source source = new Source(content);
source.fullSequentialParse();
final List<Element> h1s = source.getAllElements("a");
assertTrue(!h1s.isEmpty());
Element anchor = h1s.get(0);
final String href = anchor.getAttributeValue("href");
assertEquals("before" + expectedOutputSpace + "after", href);
}
}
Tests in maven project can be grabbed from here if needed:
https://github.com/liamsharp/jerichohtml-html-comments-in-css
In:
src/test/java/com/github/liamsharp/SpaceTests.java
There is a static configuration variable to control this behaviour:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Config.html#ConvertNonBreakingSpaces
This is mentioned in the documentation of the CharacterReference.decode method.
Awsome, thanks Martin, much appreciated!