Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 XMLExtractor (XML/RSS) - ID: 806831
Last Update: Comment added ( karl-ia )

RSS (whatever version) files are XML with URIs in
well-defined locations. We should have an RSS
extractor, or perhaps even a general opportunistic XML
extractor, which reads XML files and when it sees
URI-like fields, extracts them.


Gordon Mohr ( gojomo ) - 2003-09-15 22:08

7

Closed

None

Gordon Mohr

None

1.6.0

Public


Comments ( 10 )

Date: 2007-03-14 01:22
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-673 -- please add further
comments at that location.


Date: 2005-11-02 19:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Closing as implemented; further fixes/enhancements to basic
ExtractorXML can be reported as new issues.


Date: 2005-09-28 04:44
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Patterned after Igor's ExtractorCSS, I've created a
super-simple ExtractorXML. Commit comment:

Work towards [ 806831 ] XMLExtractor (XML/RSS)
* ExtractorXML.java
initial commit; based on ExtractorCSS;
extract HTTP URIs in XML attributes and simple elements
(with only the URI as non-whitespace content)

This seems enough to parse RSS, so this should probably be
added to the extractors in the default profiles.


Date: 2005-04-15 16:29
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here are rss parsers:

https://rome.dev.java.net/
http://jakarta.apache.org/commons/sandbox/feedparser/


Date: 2004-07-29 21:54
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Interesting article on why xml on the web has failed. Aside
from that has interesting discussion of xml charset handling
-- or lack thereof -- over http:

http://www.xml.com/pub/a/2004/07/21/dive.html


Date: 2004-04-13 17:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Changed summary again. Most rss feeds have a 'http' scheme.

Advantage to having a XMLExtractor -- the XMLExtractor would
do the rss files too; probably no advantage to an
exclusively rss extractor -- would be being able to exploit
the encoding stipulation at the head of the xml file. This
said, rare would be the case where the XML file would be in
an encoding that was other than single-byte or single-byte
amenable (UTF-8).

Another reason we might consider an XMLExtractor would be if
a purposed extractor was more performant than the UE (Would
it take much to make the processors standalone callable from
the command-line? Would make comparison the easier).





Date: 2004-04-02 00:16
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

UniversalExtractor might be considered 'rude' by some sites
-- in particular, discovering URIs that are specifically
crawler traps (in that they mark the offending crawler as
being too aggressive).

So there could be a good reason to split RSS/XML off
separately, to allow finer settings about aggressiveness.
For example, for RSS/XML, we probably want only URIs
delimited by \' or \" -- not delimited by spaces.


Date: 2004-04-01 00:12
Sender: ia_igorProject Admin

Logged In: YES
user_id=715474

I looked into RSS files and with out parsing rss documents
we ca not separate them from any other xml files. This might
be expensive just so that we can separate them from any
other xml files in order to parse possible relative URLs. In
general, RSS specs are not clear on how relative URLs should
be resolved. And as far as I learned about RSS files is that
relative urls are to be avoided for now. So we might just go
ahead and parse absolute URLs from all xml files with
universal extractor. An operator can just simple add filter
*.xml to this extractor and that is it.



Date: 2004-02-10 19:23
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

The universal extractor (ExtractorUniversal) does an ok job
of getting absolute URLs from XML files. It is unlikely that
a dedicated XML extractor could do much better. A
specialised RSS extractor might though be able to get
relative URLs that the universal extractor can not.


Date: 2004-02-10 17:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Updated summary to talk of a FetchRSS and a (SAX)
ExtractorXML processor.


Attached File

No Files Currently Attached

Changes ( 11 )

Field Old Value Date By
close_date - 2005-11-02 19:34 gojomo
assigned_to ia_igor 2005-11-02 19:34 gojomo
status_id Open 2005-11-02 19:34 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
priority 6 2005-09-23 19:45 gojomo
priority 7 2004-10-21 19:44 gojomo
priority 6 2004-09-01 23:00 gojomo
priority 5 2004-09-01 21:51 gojomo
summary parsing rss (xml?) (FetchRSS/ExtractorXML Processor) 2004-04-13 17:07 stack-sf
summary parsing rss (xml?) 2004-02-10 17:19 stack-sf
assigned_to nobody 2004-01-07 00:50 gojomo