Hello,

 

There is a public valid webpage from which I want to extract information using an XQuery.

I have saved the code of this webpage in a txt file and alternatively as an html document.

I have analysed the code and I have written accordingly my query to extract just the information I am looking for.

 

But before being able to make an Xquery, Kernow parses the  original code and comes with a series of error messages eventhough the code is valid as the webpage produces a nice page using my browser.

The errors I see show code written in HTML ... so It seems, the webpage I am looking at extracting information has been written in html *and* in Xhtml, but I am not sure this is the real problem.

 

Here what I have tried:

I took the complete original code of the webpage as it is and I parse it with Kernow.

I replace the first 2 lines of code defining the environment and the head with simply “ <?xml version="1.0" encoding="UTF-8"?> ” and I get the same error messages (plus a few non declared parameters).

 

The first lines of code of the webpage are:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

<title> - Recherche - page 8 - </title>

<meta name="description" content="" />

<meta name="keywords" content="" />

... </head>

Etc.

 

 

My question:

What can I do to make Kernow parse the code of the extracted webpage without error?

Does it mean the schema I take from the page (see below the schema used in the page) is not the correct one for Kernow?

 

Thanks

 

Below a  list of error messages I get with the related code that generates these errors (errors I get with either both headers, the original or the simplified one)

 

Error message:

Line 15, Col 3 The element type "meta" must be terminated by the matching end-tag "</meta>".

Code from line 9 to line 15 is:

<meta name="robots" content="noindex,follow">        ß this is the one faulty, sure, but my browser has no problem with this code

<script type="text/javascript" src="/js/mootools.js"></script>

<script type="text/javascript" src="/recherche_antidot/js/Observer.js"></script>

<script type="text/javascript" src="/recherche_antidot/js/Autocompleter.js"></script>

<script type="text/javascript" src="/recherche_antidot/js/Autocompleter.Request.js"></script>

<link href="/css/vi.css" type="text/css" rel="stylesheet" />

<link href="/css/structur.css" type="text/css" rel="stylesheet" />

</head>

 

 

Error message:

Line 39, Col 72 The reference to entity "f" must end with the ';' delimiter.

Code from line 37 to line 39 is:

<select name="crit" id="crit_vis" onchange="return loadAjaxData(this.value,'ajaxData_vis'); return false;">

                <option value="/recherche_antidot/recherche.phps=+&amp;f=Vis&amp;o=recette,ASC;afs:relevance,DESC;date_debut,ASC;date_fin,ASC;nom_eve,ASC;appellation,ASC;nom_complet,ASC;m_deg,DESC;couleur,ASC;annee,DESC;date,ASC;cat_article,ASC;parution,DESC;titre,ASC;dateline,ASC&amp;acces_libre=1">Par défaut</option>

             <option  value="/recherche_antidot/recherche.php?s=&f=Vis&o=appreciation,DESC&acces_libre=1">Note (décroissante)</option>

                                               

Error message :

Line 363, Col 16 The entity name must immediately follow the '&' in the entity reference.

Code from line 261 to line 363 is:

<a onclick="compteur(20114814)" href="le-guide/dopff-et-irion-tardives-2007-20114814.html"

title=" Dopff & Irion tardives  2007">

 

Error message:

Line 721, Col 23 Element type "scr" must be followed by either attribute specifications, ">" or "/>".

Code at line 721 is:

<script language="JavaScript" type="text/javascript">

                                document.write('<scr'+'ipt id="jspub99029" language="JavaScript"

type="text/javascript"

src="http://fr.a2dfp.net/ad?s=99029&m=js&ncb='+a2dRandom+'"></scr'+'ipt>');

                </script>

 

 

 

mit freundlichen Grüßen / with my best regards / cordialement /

Dr. Christian Dugast
tech2biz

www.tech2biz.eu

My Blog on Language Business


Moltkestr. 11
53604 Bad Honnef


T:   +49 2224 98 89 561
F:   +49 3222 64 50 867

M:   +49 151 22 33 34 32

 

http://de.linkedin.com/in/dugast

 



The information contained in this electronic mail message and any attachments hereto is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copy of this communication is strictly prohibited. If you have received this communication by error, please immediately notify us by return message and delete the original message from your mail system. Thank you.