htdig-dev Mailing List for ht://Dig (Page 108)

Brought to you by: angusgb, grdetil, lha, nealr, scherpbier

htdig-dev — Developer Discussion for the ht://Dig project

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (47)	Nov (74)	Dec (66)
2002	Jan (95)	Feb (102)	Mar (83)	Apr (64)	May (55)	Jun (39)	Jul (23)	Aug (77)	Sep (88)	Oct (84)	Nov (66)	Dec (46)
2003	Jan (56)	Feb (129)	Mar (37)	Apr (63)	May (59)	Jun (104)	Jul (48)	Aug (37)	Sep (49)	Oct (157)	Nov (119)	Dec (54)
2004	Jan (51)	Feb (66)	Mar (39)	Apr (113)	May (34)	Jun (136)	Jul (67)	Aug (20)	Sep (7)	Oct (10)	Nov (14)	Dec (3)
2005	Jan (40)	Feb (21)	Mar (26)	Apr (13)	May (6)	Jun (4)	Jul (23)	Aug (3)	Sep (1)	Oct (13)	Nov (1)	Dec (6)
2006	Jan (2)	Feb (4)	Mar (4)	Apr (1)	May (11)	Jun (1)	Jul (4)	Aug (4)	Sep	Oct (4)	Nov	Dec (1)
2007	Jan (2)	Feb (8)	Mar (1)	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov	Dec
2008	Jan (1)	Feb	Mar (1)	Apr (2)	May	Jun	Jul (1)	Aug	Sep (1)	Oct	Nov	Dec
2009	Jan	Feb	Mar (2)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (1)
2011	Jan	Feb	Mar (1)	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2016	Jan (1)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec

Flat | Threaded

<< < 1 .. 106 107 108 (Page 108 of 108)

Re: [htdig-dev] Bayesian Algorithm Part II

From: Tod T. <tt...@ch...> - 2001-10-18 12:26:36

Quim Sanmarti wrote:

> > My theory was that most new market trends (worth paying
> > attention to) are usually already, or quickly will be, reflected in the open
>
> > source development community.
>
> Entering deep water -- opinion unreliable from this point :)
> My perception is eventually the inverse. Open-source has been traditionally
> being bound to research and innovation. It's now being used by companies as
> an innovation channel, so that market trends emerge later from there...
>

Sorry, I wasn't clear on this.  I agree completely.

Tod

[htdig-dev] PATCH: Content-type-aliases

From: Jamie A. <jam...@sl...> - 2001-10-18 01:54:09

Attachments: content-type-alias.txt

Here's another quickie patch to 3.2.x.  It's support for 
a content-type alias attribute for htdig which sorts out
servers that get the content-type wrong in responses.
While getting the server correctly configured is probably 
a better solution, this works if the server is maintained 
by someone else.

usage is something like: 
content_type_aliases: text/plain=text/html
and can be set for specific servers in the server block.

Also included is the patch as an attachment for when the c+p gets mangled.

Is the config file entry in the right format?  I've not seen a definitive
description of what should be where, so it's a bit of a guess.


=======================================
diff -rup htdig/htcommon/defaults.cc htdig-patch3/htcommon/defaults.cc
--- htdig/htcommon/defaults.cc  Thu Aug 30 03:43:38 2001
+++ htdig-patch3/htcommon/defaults.cc   Thu Oct 18 14:49:27 2001
@@ -271,6 +271,13 @@ http://www.htdig.org/", " \
        compile time. \
        </p> \
 " }, \
+{ "content_type_aliases", "",  \
+    "string list", "htdig", "server", "SLI-special", "Indexing:Where", \
+    "content_type_aliases: text/plain=text/html", " \
+    This attribute tells htdig to use a different parser to that 
indicated by the content-type \
+    returned by the server.  This is occasionally useful for 
mis-configured servers who server \
+    up dynamic content but don't set the content-type correctly. \
+" }, \
 { "create_image_list", "false",  \
        "boolean", "htdig", "", "all", "Extra Output", "create_image_list: 
yes", " \
        If set to true, a file with all the image URLs that \
@@ -2545,6 +2552,8 @@ form during indexing and translated for 
        "string", "all", "", "3.2.0b1", "Extra Output", 
"wordlist_monitor_output: myfile", " \
         Print monitoring output on file instead of the default stderr. \
 " }, 
+
+
 {0, 0, 0, 0, 0, 0, 0, 0, 0}
 };
 
Only in htdig-patch3/htcommon: defaults.cc~
diff -rup htdig/htdig/Document.cc htdig-patch3/htdig/Document.cc
--- htdig/htdig/Document.cc     Thu May 17 04:36:44 2001
+++ htdig-patch3/htdig/Document.cc      Thu Oct 18 14:27:37 2001
@@ -629,7 +629,7 @@ Document::RetrieveLocal(HtDateTime date,
 //   parsers are external programs that will be used.
 //
 Parsable *
-Document::getParsable()
+Document::getParsable( const String& serverName )
 {
     static HTML                        *html = 0;
     static Plaintext           *plaintext = 0;
@@ -637,6 +637,8 @@ Document::getParsable()
 
     Parsable   *parsable = 0;
 
+    ContentTypeAlias( serverName );
+
     if (ExternalParser::canParse(contentType))
     {
        if (externalParser)
@@ -701,4 +703,51 @@ int Document::ShouldWeRetry(Transport::D
       return 1;
 
    return 0;
+}
+
+
+
+void
+Document::ContentTypeAlias( const String& serverName )
+{
+    HtConfiguration* config= HtConfiguration::config();
+    Dictionary content_type_aliases;
+
+    String l;
+    if ( serverName.length() > 0 )
+        l = config->Find("server", serverName, "content_type_aliases");
+    else
+        l = config->Find( "content_type_aliases");
+
+    if ( l.length() == 0 )
+        return;
+
+    String from, *to;
+    char *p = strtok(l, " \t");
+    char *ct_alias= NULL;
+    while (p)
+    {
+        ct_alias = strchr(p, '=');
+        if (! ct_alias )
+        {
+            p = strtok(0, " \t");
+            continue;
+        }
+        *ct_alias++= '\0';
+        from = p;
+        to= new String( ct_alias );
+        content_type_aliases.Add(from.get(), to);
+        // fprintf (stderr, "Alias: %s->%s\n", from.get(), to->get());
+        p = strtok(0, " \t");
+    }
+
+
+    String* new_ct = 0;
+    if ( (new_ct = (String*) content_type_aliases.Find( contentType )) )
+    {
+        if ( debug > 1 )
+            cout << "Translating content type '" << contentType << "' to 
'" << *new_ct << "'\n";
+        contentType = *new_ct;
+    }
+
 }
diff -rup htdig/htdig/Retriever.cc htdig-patch3/htdig/Retriever.cc
--- htdig/htdig/Retriever.cc    Tue Oct 16 15:27:16 2001
+++ htdig-patch3/htdig/Retriever.cc     Thu Oct 18 14:28:29 2001
@@ -802,7 +802,7 @@ Retriever::RetrievedDocument(Document &d
     // routines.
     // This will generate the Parsable object as a specific parser
     //
-    Parsable   *parsable = doc.getParsable();
+    Parsable   *parsable = doc.getParsable( base->host() );
     if (parsable)
       parsable->parse(*this, *base);
     else

=======================================




Jamie Anstice
Search Engineer
S.L.I. Systems
jam...@sl...
ph:  64 961 3262
mobile: 64 21 264 9347

Re: [htdig-dev] Proper way of accessing results?

From: Geoff H. <ghu...@ws...> - 2001-10-17 21:02:21

On Wed, 17 Oct 2001, Gilles Detillieux wrote:

> > Are there any particular dos and donts concerning the access of your searcher 
> > class? Can I use one searcher incarnation for multiple searches?
> > 
> > Any comments appreciated.
> 
> I haven't seen any responses to your question, but I think perhaps either

Strange, I thought I included the list on that. So it goes.

> Michael Haggerty or Geoff Hutchison should be able to answer.

No that's not quite right. Michael was doing some restructuring of htdig,
not htsearch.

> classes in other programs.  With the 3.2.0b4 code as it stands now, I don't
> think it's that easy to just strip out what you need.

This is correct, though if I actually find "free time," I'm going to try
to finish up that htsearch work.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

Re: [htdig-dev] HEADS UP: php wrapper script - potential exploit patched

From: Geoff H. <ghu...@ws...> - 2001-10-17 21:00:13

On Wed, 17 Oct 2001, Gilles Detillieux wrote:

> > I've just tagged the files with htdig-3-2-0-b4
> 
> OK, I'm not that versed in CVS, but doesn't this mean that your commits
> will only be seen by people who check out the htdig tree using that tag.

Alas, you're both confusing tags and branches slightly. Tagging the files
with htdig-3-2-0-b4 isn't good because it's strictly reserved as a label
for the 3.2.0b4 RELEASE. Since that hasn't happened yet, you might still
change the file before then.

On the other hand, tagging files is like sticking a bright yellow label on
them in the file cabinet--it doesn't change that they're in the wrong
drawer (branch).

I think I'll try to move the 3-2-x branch over to the mainline this
weekend.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots

From: Gilles D. <gr...@sc...> - 2001-10-17 20:36:03

According to Joe R. Jah:
> On Wed, 3 Oct 2001, Gilles Detillieux wrote:
> > Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT)
> > From: Gilles Detillieux <gr...@sc...>
> > To: Joe R. Jah <jj...@cl...>
> > Cc: htd...@li...
> > Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> > 
> > > > > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > > > > compare the outputs, you may be able to track down the source of the
> > > > > > different URLs that are parsed in both cases.  To do this in a meaningful
> > > > > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > > > > your site, so you don't get thrown off in your comparisons by updates
> > > > > > to the site between digs.
> > > > > 
> > > > > Yes, I have kept that snapshot for a happy occasion like that;)
> > > > 
> > > > Keep me posted if you get a chance to run this test with both snapshots.
> > > > I can't think of any changes to 3.1.6 that would cause it to lose valid
> > > > URLs, but it would be good to confirm without a doubt that the lost URLs
> > > > on your system are all indeed URLs that should not have been indexed.
> > > 
> > > In the happy hour;)))
> > 
> > It might be best if you're sober when you do this test.  ;-)
> 
> The happy hour turned into a couple of unhappy weeks:(
> 
> -r--r--r--  1 jjah  www    24621528 Oct  2 13:20 rundig_vvv.082901
> -r--r--r--  1 jjah  www    20266702 Oct  2 14:15 rundig_vvv.093001
> 
> I found 82 links from one document with META ROBOT: Noindex tag;)  I could
> not find an efficient way of hunting down the other 138 links that were
> unaccounted for in two 20 meg+ files; however, I must assume that they are
> some sort of duplicates;-/

Hmm.  Too bad we couldn't get something more definitive.  I'm fairly
confident that the changes to the HTML parser didn't break anything, but
I'd feel much more comfortable if we could explain the missing files you
discovered rather than just assuming it's OK.  If I recall, there were
88 URLs with doubled slashes that were eliminated in an earlier test,
but that still leaves around 50 URLs unaccounted for.

If there's any way you can take a snapshot of your site, or a few major
subdirectories, and duplicate them somewhere else where they won't get
modified, it would be a big help in getting conclusive results.  If you
index the exact same files with 3.1.5 and 3.1.6, you should be able to
diff the output of htdig -vvv from both, and pinpoint exactly where the
differences are happening.  I know this is asking a lot, but it would be
a shame to release 3.1.6 after all the work that's gone into it, only to
discover afterward that it introduced a serious bug.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

Re: [htdig-dev] HEADS UP: php wrapper script - potential exploit patched

From: Gilles D. <gr...@sc...> - 2001-10-17 19:58:47

According to Dan Langille:
> On 12 Oct 2001 at 14:53, Gilles Detillieux wrote:
> > According to Dan Langille:
> > > I have just committed a fix to the php-wrapper.  This may or may not
> > > have been a potential exploit.  The fix prevents people from including
> > > arbitrary HTML or PHP code in their search string.  The fix
> > > strips such tags from the input string.
> > 
> > OK, but I thought I should point out that you're still committing to
> > the main branch of the htdig source tree, which hasn't seen any activity
> > since Feb 2000.  If you want your wrapper to be included in the 3.2.0b4
> > development, you have to check out the "htdig-3-2-x" branch of the tree
> > and commit your changes to that.
> 
> I've just tagged the files with htdig-3-2-0-b4

OK, I'm not that versed in CVS, but doesn't this mean that your commits
will only be seen by people who check out the htdig tree using that tag.
As far as I know, that tag hasn't been used by anyone else.  Is this
correct, Geoff, or am I confusing tags and branches?

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

Re: [htdig-dev] Proper way of accessing results?

From: Gilles D. <gr...@sc...> - 2001-10-17 19:54:41

According to Wolfgang Mueller:
> Hi,
> I am currently looking into writing a htdig plugin for the GIFT.
> 
> I want people to configure htdig the usual way and (if necessary) to point to 
> a htdig configuration within the usual GIFT configuration file. 
> 
> What I would like to write, is just something short which does not use html 
> for querying htdig, but which interacts with your internal search functions, 
> something like a stripped version of the main of htsearch.
> 
> Are there any particular dos and donts concerning the access of your searcher 
> class? Can I use one searcher incarnation for multiple searches?
> 
> Any comments appreciated.

I haven't seen any responses to your question, but I think perhaps either
Michael Haggerty or Geoff Hutchison should be able to answer.  Both of them
are working on, or planning to work on, some pretty major restructuring of
the htsearch code, at least in part to make it easier to use the searcher
classes in other programs.  With the 3.2.0b4 code as it stands now, I don't
think it's that easy to just strip out what you need.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

Re: [htdig-dev] htdig features

From: Gilles D. <gr...@sc...> - 2001-10-17 19:45:11

According to Andrew Daviel:
> On Sun, 30 Sep 2001, Geoff Hutchison wrote:
> re. logging broken links
> >
> > I don't really see this as a needed feature to ht://Dig itself since
> > you can do this as a script running on top of the htdig output using
> > the -s flag. See
> > <http://www.htdig.org/files/contrib/scripts/showdead.pl>
> > <http://www.htdig.org/files/contrib/scripts/report_missing_pages.pl>
> 
> I played with this briefly.
> Htdig listed a broken link (404) on the same server, and a 404 link to
> another server.
> 
> It didn't list a 502 (connection refused) or 500 (unknown host).
> It also doesn't grab any mail information or owner metadata, so
> you'd either have to have a database of sites/URL fragments against
> authors, or re-spider the list of referring documents to gather author
> information, if you wanted to mail authors or sort the broken list
> by author.
> My robot produces e.g. http://www.triumf.ca/trsearch/errors2.html
> (but I'm too lazy to fix my links - ad...@tr...)

Neither htdig 3.1.x nor 3.2.x deal with 500 and 502 error codes right now.
Are you indexing through a proxy server?  Normally, htdig will detect these
two error conditions itself, and set the appropriate internal error status
codes to generate the error messages for missing pages, but I don't know
what happens when a proxy server detects these error conditions and reports
them back to htdig.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

RE: [htdig-dev] Bayesian Algorithm Part II

From: Geoff H. <ghu...@ws...> - 2001-10-17 15:11:46

On Wed, 17 Oct 2001, Quim Sanmarti wrote:

> > That's when I began to wonder if anybody in the htdig
> > development community had looked into implementing 'bayesian'
> > searching, or if htdig could do 'it', hence my vague post.
> 
> The htdig databases are postitively not prepared to deal with such
> techniques, IMHO. They are not intended to. The power of htdig is based in
> 'classical' boolean queries.

I don't know that I'd call them "classical" anymore, but I'd agree that
I'd design a different word database backend if I wanted to do Bayesian
queries. But I think you can get pretty high quality results without
this--AFAIK, Google doesn't use them and most people see them as the
target.

> > My theory was that most new market trends (worth paying
> > attention to) are usually already, or quickly will be, reflected in the open
> > source development community.
> My perception is eventually the inverse. Open-source has been traditionally
> being bound to research and innovation. It's now being used by companies as
> an innovation channel, so that market trends emerge later from there...

This depends a lot on the development effort. Certainly gcc has some true
innovation and is a great example of getting truly fantastic people
together--I doubt you could ever afford to pay for all the development on
gcc.

In this case, I think parts of ht://Dig could be used for research
purposes and I think there are several research-grade algorithms that
could be implemented without too much effort (n-gram fuzzy algorithms come
to mind).

On the other hand, the number of active contributors to the project right
now is extremely low and so I think we'd need an infusion of "fresh
brains" before this could happen.

As Quim pointed out as well, there are other packages which attempted to
tackle Bayesian searches.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

RE: [htdig-dev] Bayesian Algorithm Part II

From: Quim S. <qs...@gt...> - 2001-10-17 14:29:42

> -----Mensaje original-----
> De: htd...@li...
> [mailto:htd...@li...]En nombre de Tod Thomas
> Enviado el: miercoles, 17 de octubre de 2001 15:23
> Para: htd...@li...
> Asunto: [htdig-dev] Bayesian Algorithm Part II
[snip]
>
> A common marketing thread for a number of the vendors was the
> utilization of the 'bayesian algorithm'.

Some vendors, like Autonomy, do use this kind of techniques in their search
engines.
'Bayesian', or probabilistic, models are computed from indexed documents.
The longer the query, the better the results. As with vector-space
similarity measures, this is particularly useful when searching 'documents
similar to this' or when classifiyng documents. Results tend to be poorer
when using short queries.

> I have to admit that those products tended to outperform
> the others, particularly when it came to language specific
> searches.  It was almost as if the search could understand the concept
behind
> the request, not just the words.

In fact, what is evaluated is the 'blend' of the words, i.e. the more or
less roughly estimated probability of finding a given set of words in each
document.

>
> That's when I began to wonder if anybody in the htdig
> development community had looked into implementing 'bayesian'
> searching, or if htdig could do 'it', hence my vague post.
>

The htdig databases are postitively not prepared to deal with such
techniques, IMHO. They are not intended to. The power of htdig is based in
'classical' boolean queries.

> My theory was that most new market trends (worth paying
> attention to) are usually already, or quickly will be, reflected in the
open
> source development community.

Entering deep water -- opinion unreliable from this point :)
My perception is eventually the inverse. Open-source has been traditionally
being bound to research and innovation. It's now being used by companies as
an innovation channel, so that market trends emerge later from there...

> I suspected this might be the case with htdig
> too.  Given the individual response that I got there is at least some
> interest so by elaborating a little on my original post maybe I can learn
more.  Comments?
>

Nearly related to probabilistic stuff is Xapian, a.k.a. Omseek, a.k.a.
Omsee, a.k.a. Open Muscat, an open-source project intended as a
probabilistic search-engine framework. Initially financed by Brightstation,
was some time ago left to its own. Now lives in Sourceforge.

More in the research field, there's libbow/rainbow by Andrew McCallum et al.
from CMU, including bayesian classifiers, vector-space algorithms, and other
nice artifacts.

Here at gtd, we're experimenting internally with some new vector-space based
search and classification algorithms. What we have does look quite
promising, but AFAIK it's not to be open-sourced -- by now.

--
Quim

[htdig-dev] Bayesian Algorithm Part II

From: Tod T. <tt...@ch...> - 2001-10-17 13:14:38

At a helpful list member's prompting, it might help if I present the reason for
my original post.

My original query was intended to see if anybody was looking into implementing
this into htdig.  We've just gone through a pretty extensive vendor selection
process to find the 'best' search engine to couple with an internal knowledge
management initiative.

A common marketing thread for a number of the vendors was the utilization of the
'bayesian algorithm'.  I have to admit that those products tended to outperform
the others, particularly when it came to language specific searches.  It was
almost as if the search could understand the concept behind the request, not
just the words.  That's when I began to wonder if anybody in the htdig
development community had looked into implementing 'bayesian' searching, or if
htdig could do 'it', hence my vague post.

My theory was that most new market trends (worth paying attention to) are
usually already, or quickly will be, reflected in the open source development
community.  I suspected this might be the case with htdig too.  Given the
individual response that I got there is at least some interest so by elaborating
a little on my original post maybe I can learn more.  Comments?

Thanks - Tod

[htdig-dev] geographic searching in ht//dig

From: Andrew D. <an...@da...> - 2001-10-16 19:34:44

(I've been trying to write this for weeks and keep getting distracted..)

I have been working on a geographic search engine, and as I mentioned in
my earlier "features" message, I am getting fed up with my Perl robot, and
am trying to use htdig.

I have been somewhat successful and have a version without map
navigation  at http://geotags.com/htdig/

My question is really how much support, if any, I might get from the htdig
community for this, either as mainstream htdig or from anyone else
interested in this kind of thing.

The basic idea is that given a set of Web pages describing
physical objects, such as restaurants, bridges, parks etc., that
the search engine is capable of finding the nearest one.
This requires metadata on the page explicitly giving the
position being described (though it can sometimes be guessed
from things like US zipcodes in the text), and a search algorithm that
can score according to a geographic distance.
The search algorithm is not a true geographic one (as for example
"show me all objects completely or partially inside this polygon") but
rather a modification of a text search, e.g. "pizza AND restaurant SORTBY
distance"



The changes required are (so far)

A database element to store a position for each page
  (I actually store a region code and placename too but don't use them)

An addition to the HTML parser to get the metadata

An addition to the CGI parser to get a requested position (map click)

A weighting algorithm to calculate geographic distance

I also have a config item to essentially force a ROBOTS NONE if there
is no geographic tag on a page, so that I can refrain from indexing
untagged pages.

I am also trying to add support for position passed in an experimental
HTTP header, which allows one to dispense with the map and potentially
generate requests based on current position automatically, e.g.
using GPS.

This is essentially what is online at http://geotags.com/htdig/, using a
fixed map. I will probably create custom templates and
then try to run this version of htsearch using the original Perl as
a wrapper, to enable the map zoom/pan features, which requires
scaling the map clicks according to the currently displayed map.
I could either do the mapclick scaling externally, passing pure
Lat/Long to htsearch, or pass the current map extents to htsearch
and do it internally.

Doing the whole thing including the map manipulations and
graphical markup is somewhat more complicated, and might
require hardwiring some map dependant code, unless it could be done in
templates. I haven't really looked at that yet.

-- 
Andrew Daviel
geotags.com etc.

[htdig-dev] Re: Dangers of posting images: Pretty examples

From: Florian H. <fl...@ha...> - 2001-10-15 16:46:04

I just sent this to bugtraq:

In Fri, Oct 12, 2001 at 12:59:13PM -0600, Dave Ahmad wrote:
> On Thu, 11 Oct 2001, bugtraq wrote:
> > http://www.perl.com/search/index.ncsp?sp-q=%3C%69%6D%67%20%73%72%63%3D%68%74%74%70%3A%2F%2F%31%39%39%2E%31%32%35%2E%38%35%2E%34%36%2F%74%69%6D%65%2E%6A%70%67%3E

> Does anyone know which search engine software this is?

I don't know which engine perl.com uses, but if you have the template
parameter WORDS in you templates, htdig 3.1.5 puts the unquoted img-tag
into the result page.

Funnily enough, the htdig 3.1.5 on htdig.org encodes the offending string
in
<input type="text" size="30" name="words" value="&lt;img src=http://199.125.85.46/time.jpg&gt;">

while the distributed htdig 3.1.5 (here the debian-version 3.1.5-2) doesn't:

<input type="text" size="30" name="words" value="<img src=http://199.125.85.46/time.jpg>">

(And there is neither a security section on htdig.org nor an email address
for bug reports... so I am crossposting this to htdig-general)

Yours, Florian Hars.

[htdig-dev] Bayesian Algorithm

From: Tod T. <tt...@ch...> - 2001-10-15 11:55:16

Has anybody heard of this in relation to search technology and would like to
provide some input on the topic?

Thanks - Tod

Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots

From: Joe R. J. <jj...@cl...> - 2001-10-14 08:26:39

On Wed, 3 Oct 2001, Gilles Detillieux wrote:

> Date: Wed, 3 Oct 2001 09:51:03 -0500 (CDT)
> From: Gilles Detillieux <gr...@sc...>
> To: Joe R. Jah <jj...@cl...>
> Cc: htd...@li...
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> 
> > > > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > > > compare the outputs, you may be able to track down the source of the
> > > > > different URLs that are parsed in both cases.  To do this in a meaningful
> > > > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > > > your site, so you don't get thrown off in your comparisons by updates
> > > > > to the site between digs.
> > > > 
> > > > Yes, I have kept that snapshot for a happy occasion like that;)
> > > 
> > > Keep me posted if you get a chance to run this test with both snapshots.
> > > I can't think of any changes to 3.1.6 that would cause it to lose valid
> > > URLs, but it would be good to confirm without a doubt that the lost URLs
> > > on your system are all indeed URLs that should not have been indexed.
> > 
> > In the happy hour;)))
> 
> It might be best if you're sober when you do this test.  ;-)

The happy hour turned into a couple of unhappy weeks:(

-r--r--r--  1 jjah  www    24621528 Oct  2 13:20 rundig_vvv.082901
-r--r--r--  1 jjah  www    20266702 Oct  2 14:15 rundig_vvv.093001

I found 82 links from one document with META ROBOT: Noindex tag;)  I could
not find an efficient way of hunting down the other 138 links that were
unaccounted for in two 20 meg+ files; however, I must assume that they are
some sort of duplicates;-/

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        jj...@cl...

[htdig-dev] Current Status as of snapshot 3.2.0b4-101401

From: Geoff H. <ghu...@us...> - 2001-10-14 07:13:46

STATUS of ht://Dig branch 3-2-x

RELEASES:
   3.2.0b4: In progress
   3.2.0b3: Released:  22 Feb 2001.
   3.2.0b2: Released:  11 Apr 2000.
   3.2.0b1: Released:   4 Feb 2000.

SHOWSTOPPERS:

KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
   wordlist_compress set but work fine without wordlist_compress.
   (the date is definitely stored correctly, even with compression on
    so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
   consistant mapping of input -> config -> template for all inputs where
   it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set 
   correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
   we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
   not FLAG_DESCRIPTION. (PR#859)

PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
  (Should really only attempt to use SQL for doc_db and related, not word_db)

NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz

TESTING:
* httools programs: 
  (htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
  argument handling for parser/converter, allowing binary output from an
  external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.

DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
   (including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
   to template variables. (Relates to PR#648.) Also make sure these config
   attributes are all documented in defaults.cc, even if they're only set by
   input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
   requirements of 3.2.x (e.g. phrase searching, regex matching,
   external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.

OTHER ISSUES:
* Can htsearch actually search while an index is being created?
   (Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
   (It should probably just set everything to empty) This relates to 
   PR#348.

Re: [htdig-dev] HEADS UP: php wrapper script - potential exploit patched

From: Dan L. <da...@la...> - 2001-10-13 02:14:16

On 12 Oct 2001 at 14:53, Gilles Detillieux wrote:

> According to Dan Langille:
> > I have just committed a fix to the php-wrapper.  This may or may not
> > have been a potential exploit.  The fix prevents people from including
> > arbitrary HTML or PHP code in their search string.  The fix
> > strips such tags from the input string.
> 
> OK, but I thought I should point out that you're still committing to
> the main branch of the htdig source tree, which hasn't seen any activity
> since Feb 2000.  If you want your wrapper to be included in the 3.2.0b4
> development, you have to check out the "htdig-3-2-x" branch of the tree
> and commit your changes to that.

I've just tagged the files with htdig-3-2-0-b4
-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples

Re: [htdig-dev] HEADS UP: php wrapper script - potential exploit patched

From: Gilles D. <gr...@sc...> - 2001-10-12 19:53:10

According to Dan Langille:
> I have just committed a fix to the php-wrapper.  This may or may not
> have been a potential exploit.  The fix prevents people from including
> arbitrary HTML or PHP code in their search string.  The fix
> strips such tags from the input string.

OK, but I thought I should point out that you're still committing to
the main branch of the htdig source tree, which hasn't seen any activity
since Feb 2000.  If you want your wrapper to be included in the 3.2.0b4
development, you have to check out the "htdig-3-2-x" branch of the tree
and commit your changes to that.

At some point, Geoff is going to fold this branch back into the main
branch, but this hasn't happened yet.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

[htdig-dev] HEADS UP: php wrapper script - potential exploit patched

From: Dan L. <da...@la...> - 2001-10-12 18:01:48

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have just committed a fix to the php-wrapper.  This may or may not
have been a potential exploit.  The fix prevents people from including
arbitrary HTML or PHP code in their search string.  The fix
strips such tags from the input string.

To test the exploit, try entering an IMG html tag into your
search field, such as <img src=http://www.htdig.org/htdig_big.gif>.

If you see:

 There were no matches for [IMAGE] found on the website.

where [IMAGE] is the htDig image, then you have not patched
your system.
- -- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples


-----BEGIN PGP SIGNATURE-----
Version: PGP 6.5.8 -- QDPGP 2.61c
Comment: http://community.wow.net/grt/qdpgp.html

iQA/AwUBO8cv+QoLFxTP+508EQLRdQCg4+FE7xo/NxM+TpvS/0gyT9LYYTYAoOCM
bV1/W/eESdonK1V4rIfoebth
=m89W
-----END PGP SIGNATURE-----

[htdig-dev] Proper way of accessing results?

From: Wolfgang M. <Wol...@cu...> - 2001-10-11 12:27:03

Hi,
I am currently looking into writing a htdig plugin for the GIFT.

I want people to configure htdig the usual way and (if necessary) to point to 
a htdig configuration within the usual GIFT configuration file. 

What I would like to write, is just something short which does not use html 
for querying htdig, but which interacts with your internal search functions, 
something like a stripped version of the main of htsearch.

Are there any particular dos and donts concerning the access of your searcher 
class? Can I use one searcher incarnation for multiple searches?

Any comments appreciated.
Wolfgang

-- 
Dr. Wolfgang M&uuml;ller, assistant == teaching assistant
Personal

Re: [htdig-dev] problem with & in text extracts

From: Gilles D. <gr...@sc...> - 2001-10-09 21:29:14

According to Thomas Netousek:
> I am running htdig-3.2.0-0.b3.4 from the RedHat linux 7.1 distribution and
> I am indexing documents
> which have  all types of funny characters like e.g. single quotes spelled
> as &#8217;
> 
> I have seen other reports about the parser failing for &amp;amp, so I am
> wondering if this could
> be sort of a similar problem ?
> 
> Btw, I am also running htdig-3.1.5 on another machine with translate_...
> set to true and it works
> like a charm there.

I believe 3.2.0b3 will not translate numeric entities where the number
is larger than 255.  3.1.5 does, but it's a bug, because it only used 8
bit characters internally, so it only keeps the bottom 8 bits of this
number.  Because in 3.2.0b3 the numeric entity isn't converted, the
"&" goes into the excerpt literally, and so it's turned into an &amp;
entity on output so it should display literally as "&", so you will see
the numeric entity.  Given the 8-bit character set limitations in both
3.1 and 3.2, I thing that 3.2's behaviour is the lesser of two evils
when it comes to handling numeric entities above 255.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

[htdig-dev] Re: Bug found in ht://Dig htsearch CGI

From: Geoff H. <ghu...@ws...> - 2001-10-07 21:07:05

* Name: ht://Dig (htsearch CGI)

* Versions affected: 3.1.0b2 and more recent, including 3.1.5 and 3.2.0b3

* Vulnerability:   (Potential remote exposure. Denial of Service.)

* Details:
The htsearch CGI runs as both the CGI and as a command-line program. 
The command-line program accepts the -c [filename] to read in an 
alternate configuration file. On the other hand, no filtering is done 
to stop the CGI program from taking command-line arguments, so a 
remote user can force the CGI to stall until it times out (resulting 
in a DOS) or read in a different configuration file.

For a remote exposure, a specified configuration file would need to 
be readable via the webserver UID, e.g. via anonymous FTP with upload 
enabled or samba world-readable log files are the possible targets) 
to potentially retrieve files readable by the webserver UID.
e.g.
nothing_found_file: /path/to/the/file/we/steal

* Potential exploit:
http://your.host/cgi-bin/htsearch?-c/dev/zero
http://your.host/cgi-bin/htsearch?-c/path/to/my.file

* Fix:
Upgrade to current prerelease versions of 3.1.6 or 3.2.0b4, or apply 
attached patches.

Prerelease versions are available from <http://www.htdig.org/files/snapshots/>

[htdig-dev] Current Status as of snapshot 3.2.0b4-100701

From: Geoff H. <ghu...@us...> - 2001-10-07 07:13:44

STATUS of ht://Dig branch 3-2-x

RELEASES:
   3.2.0b4: In progress
   3.2.0b3: Released:  22 Feb 2001.
   3.2.0b2: Released:  11 Apr 2000.
   3.2.0b1: Released:   4 Feb 2000.

SHOWSTOPPERS:

KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
   wordlist_compress set but work fine without wordlist_compress.
   (the date is definitely stored correctly, even with compression on
    so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
   consistant mapping of input -> config -> template for all inputs where
   it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set 
   correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
   we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
   not FLAG_DESCRIPTION. (PR#859)

PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
  (Should really only attempt to use SQL for doc_db and related, not word_db)

NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz

TESTING:
* httools programs: 
  (htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
  argument handling for parser/converter, allowing binary output from an
  external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.

DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
   (including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
   to template variables. (Relates to PR#648.) Also make sure these config
   attributes are all documented in defaults.cc, even if they're only set by
   input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
   requirements of 3.2.x (e.g. phrase searching, regex matching,
   external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.

OTHER ISSUES:
* Can htsearch actually search while an index is being created?
   (Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
   (It should probably just set everything to empty) This relates to 
   PR#348.

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 106 107 108 (Page 108 of 108)