Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 SURT needs facelift - ID: 1108520
Last Update: Comment added ( karl-ia )

This RFE goes into the usability/UI category.

Here are my notions on SURT. I'm guessing that others,
particularly users without the benefit of being able to
sit in an office beside webgroup members have similar
reactions. This RFE is for collecting comments on
perceptions of SURT.

SURTs such as 'http://(org,archive,www,)' and
'http://(org.archive,' make for a cognitive dissonance:
they look like URIs but are not and they have weird
stuff like unclosed parens. That they are so
'odd'/'ugly' hampers their dissemination and
understanding of what they are about.

Suggestions:

+ Make them not look like URIs (Give them their own
scheme).
+ OR, make them look just like URIs with only
difference being reversed domain portion (Let the
context rule them SURTs rather than URIs)
+ Punt the parens. at least.

I suggest that we implement this RFE before Hertirix
1.4 is released; it will help the proliferation of SURT
scope.


Michael Stack ( stack-sf ) - 2005-01-24 18:05

7

Closed

None

Gordon Mohr

Usability/UI

1.6.0

Public


Comments ( 7 )

Date: 2007-03-14 01:38
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-887 -- please add further
comments at that location.


Date: 2005-09-27 00:02
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added doc of "+" syntax for SurtPrefixScope. Commit comment:

More doc for [ 1108520 ] SURT needs facelift
* user_manual.xml
added 2 paragraphs explaining '+' syntax for specifying
SURT prefixes in seeds box

With previous doc work and general broader
discussion/understanding of SURTs, closing this issue as
completed.


Date: 2005-03-29 00:01
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Taking Andy's suggestion. Commit comment:

Work for [ 1108520 ] SURT needs facelift : accept Andy's
suggestion of SURT-specification in seeds box
* SurtPrefixScope.java
interpret seeds list as source of seeds and directives;
also ensure earlier readPrefixes at initialization() time
* SurtPrefixSet.java
add method for reading mised seeds & '+' directives from
file

SURT doc update forthcoming.



Date: 2005-03-23 18:38
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Response from Andy @ LOC on list:
Andy Boyko wrote:
> Gordon Mohr (Internet Archive) wrote:
>
>
>>We'd like feedback...
>> - Are SURT options understandable?
>> - How could they be made easier to use/understand/configure?
>>
>>
>
> It'd be nice to be able to supply auxiliary SURTs for
scoping directly
> in the webapp UI, analogously to how seeds are supplied,
without having
> to use the somewhat-awkward surts-source-file. In
simplest form, just
> an auxiliary textarea, similar to the seed entry box,
would do.
>
> A next step might be to augment the seed entry box so
that, if you're
> converting seeds-as-surt-prefixes, to be able to flag
certain supplied
> seed URLs to be used only as SURT filters rather than as
seeds. For
> example, borrowing syntax I think I've seen IA use, a seed
list:
> http://a.com/
> +http://b.com/images/
> would crawl starting from a.com, and include the b.com
path as an
> additional allowed scope. Functionally, no different than
putting
> http://a.com/ in the seeds and putting a SURT-ified
> http://(com,b,)/images/ in the surts-source-file, but
obviously simpler
> to configure.
>
> That leads to the idea of an analogous "-" prefix, but
then you start
> getting into precedence and all the other stuff that the
NewScopingModel
> is for...
>
> -Andy
> aboy@loc.gov


Date: 2005-03-02 20:35
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Expand doc. Solicit input from list about understanding of
SURTs, and whether SurtPrefixScope (with auto
seed-to-surting) should become new default.




Date: 2005-03-02 19:18
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

If there were a surt scheme, would it be possible to do away
with the hard-to-find options 'surts-source-file' and
'seeds-as-surt-prefixes' -- simplifying surt config (Notion
of mixing surt scheme and normal scheme in scope is an
interesting idea -- would it be hard to do?).

Another option taken from reverse dns would be to add a
suffix to the domain part when its a surt as in:
http://org.archive.www.surt/ or http://com.surt-form/doc/. I
ain't sure I like this one too much -- would have to be some
unique charsequence other than 'surt' or 'surt-form'.

For sure, you can add documentation to compensate for
awkward config (surt is lacking here) but this RFE is about
usability, making it so resort to doc. is minimized.


Date: 2005-02-09 21:51
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

I think making SURTs look any more like URIs would risk more
confusion and bugs: the 'cognitive dissonance' (and parse
errors) caused by commas and parentheses specifically
protect against misinterpreting SURTs and SURT prefixes.

(Which brings up an important clarification: the SURT form
of a URI will always have matched parenthesis... it's only
the truncated prefix, used for scoping, that will have
fragments of hostnames and open parens. The terminiology may
need cleanup.)

A special scheme, like: 'surt:http://org.archive.www./'
might work, but still has a higher risk of
misinterpretation, and could encourage the idea that regular
URIs and SURT-form URIs can be mixed in the same lists...
which while an interesting possibility, opens other issues
we'd have to tackle.

I prefer to fix any problems here with more documentation,
and clarifying the terminology/uses.


Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-09-27 00:02 gojomo
close_date - 2005-09-27 00:02 gojomo
artifact_group_id None 2005-09-23 20:53 gojomo
assigned_to nobody 2005-03-02 20:35 gojomo
priority 5 2005-02-10 00:34 stack-sf