|
From: Geoff H. <ghu...@ws...> - 2002-05-10 16:27:58
|
On Monday, May 6, 2002, at 08:51 PM, Lachlan Andrew wrote: > Greetings, > > I'd like to help fix this problem ([413879] local files > and external parsers), to get searching back into the KDE > help centre. I'm not sure what problem you're referring to. Could you point me to a URL of the problem report? It's certainly not in the ht://Dig bug tracker at SourceForge. As far as contributing changes to a project, it's best to send a patch to the active developers. For some guidelines (which aren't necessarily specific to ht://Dig), see: <http://sourceforge.net/docman/display_doc.php?docid=3430&group_id=4593> > I've never contributed to something like this before. > Could you please tell me how to go about it, so that my > changes don't just become a "hack"? > > Thanks in advance, > Lachlan -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-05-10 22:24:24
|
Since there are several people on the htdig-dev list, I'd prefer to keep discussion on list. This also prevents duplication of effort in case someone else is thinking about doing similar work. On Fri, 10 May 2002, Lachlan Andrew wrote: > On Thu, May 09, 2002 at 11:13:29PM -0500, Geoff Hutchison wrote: > > On Monday, May 6, 2002, at 08:51 PM, Lachlan Andrew wrote: > > >I'd like to help fix this problem ([413879] local files > > >and external parsers), to get searching back into the KDE > > >help centre. > > > > I'm not sure what problem you're referring to. Could you point me to a > > URL of the problem report? It's certainly not in the ht://Dig bug > > tracker at SourceForge. > > It's in the Feature Requests: > > http://sourceforge.net/tracker/index.php?func=detail&aid=413879&group_id=4593&atid=354593 > > KDE help used to use ht://Dig to provide a search capability. > They changed the format of their files from HTML to docbook (XML). > For some reason, ht://Dig refuses to call the parser that one of > the KDE developers wrote. The response was that it was not a bug, > but a calculated feature, because ht://Dig didn't know that no server > parsing was necessary. > > It sounds like the simplest hack is just to add "docbook" to the > list of acceptable extensions (and get that into the CVS of course!). > However, the response suggested some more elegant solutions, which > I'll try... The best strategy depends on the version of ht://Dig you're intending to use. If you're only going to use 3.1.x versions for the near future, then yes, the hack you've mentioned is the best approach--add ".docbook" to the code in htdig/Document.cc::RetrieveLocal and the appropriate MIME type. For 3.2, the best approach is to either: a) Index using file:// URLs, which should use the appropriate mime.types file: <http://www.htdig.org/dev/htdig-3.2/attrs.html#mime_types> b) Code the RetrieveLocal method to produce temporary file:// URLs that are retrieved using the htnet/HtFile methods. (which again should use the appropriate mime.types file) The feature request is for option (b) and we'd certainly appreciate work in this direction. However, if you're looking for a workaround, in retrospect, I'd suggest using file:// URLs with 3.2 anyway. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Lachlan A. <lh...@ee...> - 2002-06-13 02:10:38
Attachments:
patch.mime
|
On Fri, May 10, 2002 at 06:19:11PM -0400, Geoff Hutchison wrote: > On Fri, 10 May 2002, Lachlan Andrew wrote: > > KDE help used to use ht://Dig to provide a search capability. > > They changed the format of their files from HTML to docbook (XML). > > For some reason, ht://Dig refuses to call the parser that one of > > the KDE developers wrote. The response was that it was not a bug, > > but a calculated feature, because ht://Dig didn't know that no server > > parsing was necessary. > > For 3.2, the best approach is to either: > a) Index using file:// URLs, which should use the appropriate mime.types > file: <http://www.htdig.org/dev/htdig-3.2/attrs.html#mime_types> > b) Code the RetrieveLocal method to produce temporary file:// URLs that > are retrieved using the htnet/HtFile methods. (which again should use the > appropriate mime.types file) I don't understand why there needs to be a temporary file:// URL. I've attached a patch (against the latest beta, b3) in which RetrieveLocal explicitly calls the method from HtFile which checks the MIME type. Please let me know if this patch is unsuitable, and if so how I can fix it. If it is OK, I'll go ahead and implement bad_local_ext etc. Regards, Lachlan *** htdig/Document.cc Wed Jun 12 22:48:25 2002 --- htdig/Document.cc.lha Wed Jun 12 22:46:02 2002 *************** *** 494,507 **** char *ext = strrchr((char*)*filename, '.'); if (ext == NULL) return Transport::Document_not_local; ! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) ! contentType = "text/html"; ! else if ((mystrcasecmp(ext, ".txt") == 0) || (mystrcasecmp(ext, ".asc") == 0)) ! contentType = "text/plain"; ! else if ((mystrcasecmp(ext, ".pdf") == 0)) ! contentType = "application/pdf"; ! else if ((mystrcasecmp(ext, ".ps") == 0) || (mystrcasecmp(ext, ".eps") == 0)) ! contentType = "application/postscript"; else return Transport::Document_not_local; --- 494,502 ---- char *ext = strrchr((char*)*filename, '.'); if (ext == NULL) return Transport::Document_not_local; ! const String *type = HtFile::Ext2Mime (ext + 1); ! if (type != NULL) ! contentType = *type; else return Transport::Document_not_local; *** htnet/HtFile.cc Wed Jun 12 22:48:50 2002 --- htnet/HtFile.cc.lha Wed Jun 12 22:46:15 2002 *************** *** 77,92 **** } ! /////// ! // Manages the requesting process ! /////// ! ! HtFile::DocStatus HtFile::Request() { static Dictionary *mime_map = 0; if (!mime_map) { ifstream in(config["mime_types"].get()); if (in) { --- 77,92 ---- } ! // Return mime type indicated by extension ext (which is assumed not ! // to contain the '.'), or NULL if ext is not a know mime type. ! const String *HtFile::Ext2Mime (const char *ext) { static Dictionary *mime_map = 0; if (!mime_map) { + if (debug > 2) + cout << "MIME types: " << config ["mime_types"].get() << endl; ifstream in(config["mime_types"].get()); if (in) { *************** *** 104,114 **** --- 104,138 ---- String mime_type = split_line[0]; // Fill map with values. for (int i = 1; i < split_line.Count(); i++) + { + if (debug > 3) + cout << "MIME: " << split_line[i] + << "\t-> " << mime_type << endl; mime_map->Add(split_line[i], new String(mime_type)); + } } } } + if (debug > 4) + cout << "Checking extension: " << ext << endl; + if (mime_map) // is this 'if' needed? + { + const String *mime_type = (String *)mime_map->Find(ext); + if (mime_type) + return mime_type; + else + return NULL; + } + else + return NULL; + } + + /////// + // Manages the requesting process + /////// + HtFile::DocStatus HtFile::Request() + { // Reset the response _response.Reset(); *************** *** 169,184 **** if (ext == NULL) return Transport::Document_not_local; ! if (mime_map) ! { ! String *mime_type = (String *)mime_map->Find(ext + 1); ! if (mime_type) ! _response._content_type = *mime_type; ! else ! return Transport::Document_not_local; ! } else { if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) _response._content_type = "text/html"; else if (mystrcasecmp(ext, ".txt") == 0) --- 193,205 ---- if (ext == NULL) return Transport::Document_not_local; ! const String *mime_type = Ext2Mime (ext + 1); ! if (mime_type) ! // if (bad_local_ext (ext)) return Transport::Document_not_local; else ! _response._content_type = *mime_type; else { + if (debug > 2) cout << "Extension " << ext+1 << " not found\n"; if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) _response._content_type = "text/html"; else if (mystrcasecmp(ext, ".txt") == 0) *** htnet/HtFile.h Wed Jun 12 22:48:57 2002 --- htnet/HtFile.h.lha Wed Jun 12 22:46:19 2002 *************** *** 63,68 **** --- 63,72 ---- // manages a Transport request (method inherited from Transport class) virtual DocStatus Request (); + + // Determine Mime type of file + // (Does it belong here??) + static const String *Ext2Mime (const char *); /////// // Interface for resource retrieving *** htsearch/Display.cc Wed Jun 12 22:49:28 2002 --- htsearch/Display.cc.lha Wed Jun 12 22:46:32 2002 *************** *** 35,40 **** --- 35,41 ---- #include <ctype.h> #include <syslog.h> #include <locale.h> + #include <float.h> // for DBL_MAX on Mandrake 8.2, gcc 2.96 #include <math.h> #if !defined(DBL_MAX) && defined(MAXFLOAT) *** installdir/mime.types Wed Jun 12 22:49:45 2002 --- installdir/mime.types.lha Wed Jun 12 22:46:44 2002 *************** *** 264,269 **** --- 264,270 ---- text/vnd.latex-z text/x-setext etx text/xml xml + text/docbook docbook video/mpeg mpeg mpg mpe video/quicktime qt mov video/vnd.motorola.video -- Lachlan Andrew lh...@ee... Phone: +613 8344-3816 Fax: +613 8344-6678 Department of Electrical and Electronic Engineering CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Gilles D. <gr...@sc...> - 2002-06-14 23:03:23
|
According to Lachlan Andrew: > On Fri, May 10, 2002 at 06:19:11PM -0400, Geoff Hutchison wrote: > > On Fri, 10 May 2002, Lachlan Andrew wrote: > > > KDE help used to use ht://Dig to provide a search capability. > > > They changed the format of their files from HTML to docbook (XML). > > > For some reason, ht://Dig refuses to call the parser that one of > > > the KDE developers wrote. The response was that it was not a bug, > > > but a calculated feature, because ht://Dig didn't know that no server > > > parsing was necessary. > > > > For 3.2, the best approach is to either: > > a) Index using file:// URLs, which should use the appropriate mime.types > > file: <http://www.htdig.org/dev/htdig-3.2/attrs.html#mime_types> > > b) Code the RetrieveLocal method to produce temporary file:// URLs that > > are retrieved using the htnet/HtFile methods. (which again should use the > > appropriate mime.types file) > > I don't understand why there needs to be a temporary file:// URL. > I've attached a patch (against the latest beta, b3) in which > RetrieveLocal explicitly calls the method from HtFile which checks > the MIME type. > > Please let me know if this patch is unsuitable, and if so how I can > fix it. If it is OK, I'll go ahead and implement bad_local_ext etc. I think your approach is a simple and elegant solution to this problem. I'd recommend two changes: 1) Grab the most recent 3.2.0b4 snapshot from http://www.htdig.org/files/snapshots/ and adapt your code to that version. Some of the mime.types handling code in HtFile::Request() has changed subtly between 3.2.0b3 and now, so it would be better to work with the current code base. Also, your patch to htsearch/Display.cc isn't needed with 3.2.0b4. 2) The HtFile::Request() and Document::RetrieveLocal() methods both have some hardcoded extensions, which should probably be kept in the new HtFile::Ext2Mime() method. HtFile::Request() currently falls back on these when it can't open mime.types. Thanks! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@ee...> - 2002-06-24 01:54:32
|
On Fri, Jun 14, 2002 at 06:03:09PM -0500, Gilles Detillieux wrote:
> I'd recommend two changes:
> 1) Grab the most recent 3.2.0b4 snapshot
> 2) The HtFile::Request() and Document::RetrieveLocal() methods both
> have some hardcoded extensions, which should probably be kept in the
> new HtFile::Ext2Mime() method. HtFile::Request() currently falls back
> on these when it can't open mime.types.
Greetings,
Below is the patch against 3.2.0b4-20020616. This includes the
hardcoded types, and bad_local_extensions to allow .php etc. not
to be parsed locally. If bad_local_extensions is explicitly set
empty, do you think it would be good to allow *all* files to be parsed
locally (even those with no extensions)? Of course, ones for which no
MIME type is known would have to be treated as text/plain but it
would be good if a site has a lot of text files with no extensions.
Also, would there be any demand to index compressed files? If someone
has a lot of .ps.gz files, for example, it could be useful to include
them in the index.
Finally, I think someon has been editing the files with a tab size other
than 8... Is there a policy on that?
Cheers,
Lachlan
*** htdig/Document.cc Sun Jan 13 19:13:13 2002
--- htdig/Document.cc.lha Mon Jun 24 01:06:48 2002
***************
*** 72,78 ****
FileConnect = 0;
NNTPConnect = 0;
externalConnect = 0;
! HtConfiguration* config= HtConfiguration::config();
// We probably need to move assignment of max_doc_size, according
// to a server or url configuration value. The same is valid for
--- 72,78 ----
FileConnect = 0;
NNTPConnect = 0;
externalConnect = 0;
! HtConfiguration* config= HtConfiguration::config();
// We probably need to move assignment of max_doc_size, according
// to a server or url configuration value. The same is valid for
***************
*** 549,555 ****
Transport::DocStatus
Document::RetrieveLocal(HtDateTime date, StringList *filenames)
{
! HtConfiguration* config= HtConfiguration::config();
struct stat stat_buf;
String *filename;
--- 549,555 ----
Transport::DocStatus
Document::RetrieveLocal(HtDateTime date, StringList *filenames)
{
! HtConfiguration* config= HtConfiguration::config();
struct stat stat_buf;
String *filename;
***************
*** 558,564 ****
// Loop through list of potential filenames until the list is exhausted
// or a suitable file is found to exist as a regular file.
while ((filename = (String *)filenames->Get_Next()) &&
! ((stat((char*)*filename, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
if (debug > 1)
cout << " tried local file " << *filename << endl;
--- 558,564 ----
// Loop through list of potential filenames until the list is exhausted
// or a suitable file is found to exist as a regular file.
while ((filename = (String *)filenames->Get_Next()) &&
! ((stat((char*)*filename, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
if (debug > 1)
cout << " tried local file " << *filename << endl;
***************
*** 572,593 ****
if (modtime <= date)
return Transport::Document_not_changed;
- // Process only HTML files (this could be changed if we read
- // the server's mime.types file).
- // (...and handle a select few other types for now... this should
- // eventually be handled by the "file://..." handler, which uses
- // mime.types to determine the file type.) -- FIXME!!
char *ext = strrchr((char*)*filename, '.');
if (ext == NULL)
return Transport::Document_not_local;
! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0))
! contentType = "text/html";
! else if ((mystrcasecmp(ext, ".txt") == 0) || (mystrcasecmp(ext, ".asc") == 0))
! contentType = "text/plain";
! else if ((mystrcasecmp(ext, ".pdf") == 0))
! contentType = "application/pdf";
! else if ((mystrcasecmp(ext, ".ps") == 0) || (mystrcasecmp(ext, ".eps") == 0))
! contentType = "application/postscript";
else
return Transport::Document_not_local;
--- 572,585 ----
if (modtime <= date)
return Transport::Document_not_changed;
char *ext = strrchr((char*)*filename, '.');
+ if (ext && strchr(ext,'/')) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
if (ext == NULL)
return Transport::Document_not_local;
! const String *type = HtFile::Ext2Mime (ext + 1);
! if (type != NULL)
! contentType = *type;
else
return Transport::Document_not_local;
*** htnet/HtFile.h Mon Jun 24 01:02:42 2002
--- htnet/HtFile.h.lha Mon Jun 24 01:02:51 2002
***************
*** 64,69 ****
--- 64,73 ----
// manages a Transport request (method inherited from Transport class)
virtual DocStatus Request ();
+ // Determine Mime type of file
+ // (Does it belong here??)
+ static const String *Ext2Mime (const char *);
+
///////
// Interface for resource retrieving
///////
*** htnet/HtFile.cc Sun Dec 23 19:13:14 2001
--- htnet/HtFile.cc.lha Mon Jun 24 00:48:34 2002
***************
*** 76,96 ****
}
! ///////
! // Manages the requesting process
! ///////
!
! HtFile::DocStatus HtFile::Request()
{
- HtConfiguration* config= HtConfiguration::config();
static Dictionary *mime_map = 0;
if (!mime_map)
{
mime_map = new Dictionary();
ifstream in(config->Find("mime_types").get());
if (in)
{
String line;
while (in >> line)
{
--- 76,110 ----
}
! // Return mime type indicated by extension ext (which is assumed not
! // to contain the '.'), or NULL if ext is not a know mime type, or
! // is listed in bad_local_extensions.
! const String *HtFile::Ext2Mime (const char *ext)
{
static Dictionary *mime_map = 0;
if (!mime_map)
{
+ HtConfiguration* config= HtConfiguration::config();
mime_map = new Dictionary();
+ if (!mime_map)
+ return NULL;
+
+ if (debug > 2)
+ cout << "MIME types: " << config->Find("mime_types").get() << endl;
ifstream in(config->Find("mime_types").get());
if (in)
{
+ // Set up temporary dictionary of extensions not to parse locally
+ Dictionary bad_local_exts;
+ StringList split_exts(config->Find("bad_local_extensions"), "\t .");
+ for (int i = 0; i < split_exts.Count(); i++)
+ {
+ if (debug > 3)
+ cout << "Bad local extension: " << split_exts[i] << endl;
+ bad_local_exts.Add(split_exts[i], 0);
+ }
+
String line;
while (in >> line)
{
***************
*** 99,114 ****
if ((cmt = line.indexOf('#')) >= 0)
line = line.sub(0, cmt);
StringList split_line(line, "\t ");
! // Let's cache mime type to lesser the number of
! // operator [] callings
String mime_type = split_line[0];
// Fill map with values.
for (int i = 1; i < split_line.Count(); i++)
! mime_map->Add(split_line[i], new String(mime_type));
}
}
}
// Reset the response
_response.Reset();
--- 113,161 ----
if ((cmt = line.indexOf('#')) >= 0)
line = line.sub(0, cmt);
StringList split_line(line, "\t ");
! // cache mime type to lessen the number of operator [] callings
String mime_type = split_line[0];
// Fill map with values.
for (int i = 1; i < split_line.Count(); i++)
! {
! const char *ext = split_line [i];
! if (bad_local_exts.Exists(ext))
! {
! if (debug > 3)
! cout << "Bad local extension: " << ext << endl;
! continue;
! }
!
! if (debug > 3)
! cout << "MIME: " << ext << "\t-> " << mime_type << endl;
! mime_map->Add(ext, new String(mime_type));
! }
}
}
+ else
+ {
+ if (debug > 2)
+ cout << "MIME types file not found. Using default types.\n";
+ mime_map->Add(String("html"), new String("text/html"));
+ mime_map->Add(String("htm"), new String("text/html"));
+ mime_map->Add(String("txt"), new String("text/plain"));
+ mime_map->Add(String("asc"), new String("text/plain"));
+ mime_map->Add(String("pdf"), new String("application/pdf"));
+ mime_map->Add(String("ps"), new String("application/postscript"));
+ mime_map->Add(String("eps"), new String("application/postscript"));
+ }
}
+ // return MIME type, or NULL if not found
+ return (String *)mime_map->Find(ext);
+ }
+
+ ///////
+ // Manages the requesting process
+ ///////
+
+ HtFile::DocStatus HtFile::Request()
+ {
// Reset the response
_response.Reset();
***************
*** 166,191 ****
return Transport::Document_not_changed;
char *ext = strrchr(_url.path(), '.');
if (ext == NULL)
return Transport::Document_not_local;
! if (mime_map && mime_map->Count())
! {
! String *mime_type = (String *)mime_map->Find(ext + 1);
! if (mime_type)
! _response._content_type = *mime_type;
! else
! return Transport::Document_not_local;
! }
else
! {
! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0))
! _response._content_type = "text/html";
! else if (mystrcasecmp(ext, ".txt") == 0)
! _response._content_type = "text/plain";
! else
! return Transport::Document_not_local;
! }
_response._modification_time = new HtDateTime(stat_buf.st_mtime);
--- 213,228 ----
return Transport::Document_not_changed;
char *ext = strrchr(_url.path(), '.');
+ if (ext && strchr(ext,'/')) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
if (ext == NULL)
return Transport::Document_not_local;
! const String *mime_type = Ext2Mime(ext + 1);
! if (mime_type)
! _response._content_type = *mime_type;
else
! return Transport::Document_not_local;
_response._modification_time = new HtDateTime(stat_buf.st_mtime);
*** htcommon/defaults.cc Sun Jun 23 23:55:41 2002
--- htcommon/defaults.cc.lha Mon Jun 24 01:01:09 2002
***************
*** 145,151 ****
documents as text while they are some binary format. \
If the list is empty, then all extensions are acceptable, \
provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a>. \
" }, \
{ "bad_querystr", "", \
"pattern list", "htdig", "URL", "3.1.0", "Indexing:Where", "bad_querystr: forum=private section=topsecret&passwd=required", " \
--- 145,165 ----
documents as text while they are some binary format. \
If the list is empty, then all extensions are acceptable, \
provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a> and \
! <a href=\"#bad_local_extensions\">bad_local_extensions</a>. \
! " }, \
! { "bad_local_extensions", ".php .shtml", \
! "string list", "htdig", "URL", "all", "Indexing:Where", "bad_local_extensions: .php .foo .bar", " \
! This is a list of extensions on URLs which are \
! considered active, that is, the content delivered by the web \
! server is not simply the text of the file, but is generated \
! on-the-fly. This list is used mainly to allow URLs on the local \
! machine to be read using the local filesystem, rather than \
! through HTTP. \
! If the list is empty, then all extensions are acceptable, \
! provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a> and \
! <a href=\"#bad_extensions\">bad_extensions</a>. \
" }, \
{ "bad_querystr", "", \
"pattern list", "htdig", "URL", "3.1.0", "Indexing:Where", "bad_querystr: forum=private section=topsecret&passwd=required", " \
--
Lachlan Andrew lh...@ee... Phone: +613 8344-3816 Fax: +613 8344-6678
Department of Electrical and Electronic Engineering CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|
|
From: Lachlan A. <lh...@ee...> - 2002-10-02 00:57:32
|
Greetings,
I've almost finished some patches trying to address the
"htsearch input parameters issue" (below).
I've updated defaults.cc to list all variables used by=20
any of the programs (according to "grep config"), and=20
described them as best I can. Where they are different,=20
the input parameters to htsearch are also listed, and=20
cross-referenced in both directions. Since no information=20
is better than mis-information, there are some '??'s and=20
'TO BE COMPLETED's. Some entries which were out of=20
alphabetical order have also been relocated. This patch is=20
at
http://www.ee.mu.oz.au/staff/lha/pub/patch.defaults
A second patch at
http://www.ee.mu.oz.au/staff/lha/pub/patch.inputs
makes htsearch scan the existing config parameters, and=20
overwrites them if they are given on the command line. It=20
also has the #ifdef option of checking that there are no=20
invalid (hence ignored) command line arguments. (This=20
needs cgi.h to include "Dictionary.h", and I don't know=20
how the make procedure handles dependencies, so it is=20
disabled by default.)
Before I start testing, could you please confirm that these=20
are on the right track?
Finally, the "current status" emails refer to problems=20
numbers which don't match the SourceForge problem numbers. =20
Where can I find the original numbering?
Thanks :)
Lachlan
> * Not all htsearch input parameters are handled properly:=20
> PR#648. Use a
> =A0 =A0consistant mapping of input -> config -> template for=20
> all inputs where
> =A0 =A0it makes sense to do so (everything but "config" and =20
> "words"?).
>=20
> * Document all of htsearch's mappings of input parameters=20
> to config attributes
> =A0 =A0to template variables. (Relates to PR#648.) Also make=20
> sure these config
> =A0 =A0attributes are all documented in defaults.cc, even if=20
> they're only set by
> =A0 =A0input parameters and never in the config file.
--=20
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|
|
From: Jessica B. <jes...@ya...> - 2002-10-02 13:15:29
|
I have two files on my test site being indexed.
fruit.html
pineapple.html
There's a word, "large" on fruit.html. "large" does
NOT appear anywhere within pineapple.html, however,
when I htsearch on the index, both documents show as a
match. In fact, the base_score is very high for the
query "large" on the document pineapple.html.
If I index pineapple.html alone, the query "large"
yields no results. So there is definitely some type
of relationship between the two documents and the word
"large". htdump revealed the relationship by
outputting this:
3 u:http://jtest/pineapple.htm t:pineapples a:0
m:1033563111 s:5824 H: Pineapple trees h:
l:1033563111 L:12 b:5 c:1 g:0 e:
n: S: d:large A:Stump^ATrunk
The key part is "d:large". According to the htdump
doc page online, the "d" element is defined as: "The
text of links pointing to this document. (e.g. <a
href="docURL">description</a>)"
fruit.html, the other HTML file indexed, contains a
hyperlink to pineapple.html that looks like this:
<a href="pineapple.html"
onMouseOver="MM_showHideLayers('Pine_apple','','show','large','','hide','apple','','hide','tomato','')"><img
border="0" src="images/pineapple.jpg" width="80"
height="60"></a>
What configuration attribute must set to zero in order
for that extra anchor text being indexed and factored
into pineapple.html's word list? I've basically tried
setting all "documented" factors to zero (including
backlink).
I used the latest htdig-3.2.0b4-092902 as well as a
January 2002 release -- both behave the same way.
__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com
|
|
From: Jessica B. <jes...@ya...> - 2002-10-02 14:34:45
|
Actually, to append to this problem I'm running into,
it doesn't appear to be the text WITHIN the anchor tag
-- rather, there is text between the beginning <A> and
the ending </A> of the anchor tag that contains the
word "large".
The question still is, though, how do I completely
wipe out any reference to text between these tags as
relevant to the linked document?
Thanks,
-Jes
--- Jessica Biola <jes...@ya...> wrote:
> I have two files on my test site being indexed.
>
> fruit.html
> pineapple.html
>
> There's a word, "large" on fruit.html. "large" does
> NOT appear anywhere within pineapple.html, however,
> when I htsearch on the index, both documents show as
> a
> match. In fact, the base_score is very high for the
> query "large" on the document pineapple.html.
>
> If I index pineapple.html alone, the query "large"
> yields no results. So there is definitely some type
> of relationship between the two documents and the
> word
> "large". htdump revealed the relationship by
> outputting this:
>
> 3 u:http://jtest/pineapple.htm t:pineapples a:0
> m:1033563111 s:5824 H: Pineapple trees h:
> l:1033563111 L:12 b:5 c:1 g:0 e:
>
> n: S: d:large A:Stump^ATrunk
>
> The key part is "d:large". According to the htdump
> doc page online, the "d" element is defined as:
> "The
> text of links pointing to this document. (e.g. <a
> href="docURL">description</a>)"
>
> fruit.html, the other HTML file indexed, contains a
> hyperlink to pineapple.html that looks like this:
>
> <a href="pineapple.html"
>
onMouseOver="MM_showHideLayers('Pine_apple','','show','large','','hide','apple','','hide','tomato','')"><img
> border="0" src="images/pineapple.jpg" width="80"
> height="60"></a>
>
> What configuration attribute must set to zero in
> order
> for that extra anchor text being indexed and
> factored
> into pineapple.html's word list? I've basically
> tried
> setting all "documented" factors to zero (including
> backlink).
>
> I used the latest htdig-3.2.0b4-092902 as well as a
> January 2002 release -- both behave the same way.
__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com
|
|
From: Gilles D. <gr...@sc...> - 2002-10-02 16:04:01
|
According to Jessica Biola: > Actually, to append to this problem I'm running into, > it doesn't appear to be the text WITHIN the anchor tag > -- rather, there is text between the beginning <A> and > the ending </A> of the anchor tag that contains the > word "large". Ah! Well that certainly explains why I was unable to reproduce the problem earlier, based on your first message. Yes, indexing text between the <A ...> and </A> tags as link description text for the referenced document is normal behaviour for htdig. > The question still is, though, how do I completely > wipe out any reference to text between these tags as > relevant to the linked document? As Geoff pointed out, description_factor is the one that controls the score that will be applied to this text. Unfortunately, the best you can do is drop the score of these documents so they appear at the end of the results instead of at the start. In the future, there will be some sort of option for removing matches with a 0 or small score. > > There's a word, "large" on fruit.html. "large" does > > NOT appear anywhere within pineapple.html, however, > > when I htsearch on the index, both documents show as > > a > > match. In fact, the base_score is very high for the > > query "large" on the document pineapple.html. > > > > If I index pineapple.html alone, the query "large" > > yields no results. So there is definitely some type > > of relationship between the two documents and the > > word > > "large". htdump revealed the relationship by > > outputting this: ... > > What configuration attribute must set to zero in > > order > > for that extra anchor text being indexed and > > factored > > into pineapple.html's word list? I've basically > > tried > > setting all "documented" factors to zero (including > > backlink). > > > > I used the latest htdig-3.2.0b4-092902 as well as a > > January 2002 release -- both behave the same way. By the January 2002 release, do you mean the 3.1.6 release of Jan 31/02, or one of the January snapshots of 3.2.0b4? Either way, the behaviour of description_factor will be similar, the main difference being that with 3.1.x, you need to reindex after changing this factor, whereas with 3.2.x you don't. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Jessica B. <jes...@ya...> - 2002-10-02 22:07:04
|
Thanks Gilles and Geoff for your responses. Indeed, description_factor is the setting that I overlooked confusing that with the meta_description_factor. --- Gilles Detillieux <gr...@sc...> wrote: > According to Jessica Biola: > > Actually, to append to this problem I'm running > into, > > it doesn't appear to be the text WITHIN the anchor > tag > > -- rather, there is text between the beginning <A> > and > > the ending </A> of the anchor tag that contains > the > > word "large". > > Ah! Well that certainly explains why I was unable > to reproduce the > problem earlier, based on your first message. Yes, > indexing text > between the <A ...> and </A> tags as link > description text for the > referenced document is normal behaviour for htdig. > > > The question still is, though, how do I completely > > wipe out any reference to text between these tags > as > > relevant to the linked document? > > As Geoff pointed out, description_factor is the one > that controls the > score that will be applied to this text. > Unfortunately, the best you > can do is drop the score of these documents so they > appear at the end > of the results instead of at the start. In the > future, there will be > some sort of option for removing matches with a 0 or > small score. > > > > There's a word, "large" on fruit.html. "large" > does > > > NOT appear anywhere within pineapple.html, > however, > > > when I htsearch on the index, both documents > show as > > > a > > > match. In fact, the base_score is very high for > the > > > query "large" on the document pineapple.html. > > > > > > If I index pineapple.html alone, the query > "large" > > > yields no results. So there is definitely some > type > > > of relationship between the two documents and > the > > > word > > > "large". htdump revealed the relationship by > > > outputting this: > ... > > > What configuration attribute must set to zero in > > > order > > > for that extra anchor text being indexed and > > > factored > > > into pineapple.html's word list? I've basically > > > tried > > > setting all "documented" factors to zero > (including > > > backlink). > > > > > > I used the latest htdig-3.2.0b4-092902 as well > as a > > > January 2002 release -- both behave the same > way. > > By the January 2002 release, do you mean the 3.1.6 > release of Jan 31/02, > or one of the January snapshots of 3.2.0b4? Either > way, the behaviour > of description_factor will be similar, the main > difference being that > with 3.1.x, you need to reindex after changing this > factor, whereas with > 3.2.x you don't. > > -- > Gilles R. Detillieux E-mail: > <gr...@sc...> > Spinal Cord Research Centre WWW: > http://www.scrc.umanitoba.ca/ > Dept. Physiology, U. of Manitoba Winnipeg, MB R3E > 3J7 (Canada) > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev __________________________________________________ Do you Yahoo!? New DSL Internet Access from SBC & Yahoo! http://sbc.yahoo.com |
|
From: Brian W. <bw...@st...> - 2002-10-02 22:41:09
|
At 02:03 3/10/2002, Gilles Detillieux wrote:
>According to Jessica Biola:
> > Actually, to append to this problem I'm running into,
> > it doesn't appear to be the text WITHIN the anchor tag
> > -- rather, there is text between the beginning <A> and
> > the ending </A> of the anchor tag that contains the
> > word "large".
>
>Ah! Well that certainly explains why I was unable to reproduce the
>problem earlier, based on your first message. Yes, indexing text
>between the <A ...> and </A> tags as link description text for the
>referenced document is normal behaviour for htdig.
This is a common XML/SGML terminology problem - people get "tag" and
"element" confused
"<a href='blah'>" is an "A" start *tag*
"</a>" is an "A" end *tag*
"<a href='blah'>LARGE</a>" is an "A" *element*, which is made up of a
start tag, content and end tag.
So: correct, the "A" tag does not contain "LARGE", but the "A" element does.
-------------------------
Brian White
Step Two Designs Pty Ltd
Knowledge Management Consultancy, SGML & XML
Phone: +612-93197901
Web: http://www.steptwo.com.au/
Email: bw...@st...
Content Management Requirements Toolkit
112 CMS requirements, ready to cut-and-paste
|
|
From: Gilles D. <gr...@sc...> - 2002-10-04 20:40:13
|
According to Lachlan Andrew: > I've almost finished some patches trying to address the > "htsearch input parameters issue" (below). >=20 > I've updated defaults.cc to list all variables used by=20 > any of the programs (according to "grep config"), and=20 > described them as best I can. Where they are different,=20 > the input parameters to htsearch are also listed, and=20 > cross-referenced in both directions. Since no information=20 > is better than mis-information, there are some '??'s and=20 > 'TO BE COMPLETED's. Some entries which were out of=20 > alphabetical order have also been relocated. This patch is=20 > at > http://www.ee.mu.oz.au/staff/lha/pub/patch.defaults I agree with most of the changes in this patch. Good job! To answer some of your questions, here are a few points to clarify things. The distinction between "number" and "integer" attribute types is supposed to be that an attribute labeled "number" can be floating point. However, I think in practice a lot of these are actually supposed to be integer-only. I think we'd need to check over how all attributes are used and label them consistently. The Block (Global, Server, URL) field indicates whether an attribute can be set globally only, or if it can be overridden with a different value in server blocks or URL blocks. See http://www.htdig.org/dev/htdig-3.2/cf_blocks.html The code support for author_factor, caps_factor, and url_text_factor is not complete, so I assume this is why the attributes weren't in defaults.cc. They're implemented in htsearch, but nothing in htdig tags words with their corresponding flag values yet. The remove_default_doc attribute should apply to https:// URLs as well as http:// ones. If it doesn't right now, I'd consider that a bug. The keywords and endday, startday et al. are config attributes that can be overridden by CGI input parameters, so they're not really CGI input only. All except keywords are documented for 3.1.6, in http://www.htdig.org/attrs.html, if you want more complete descriptions. (Support for negative numbers hasn't been added to 3.2 yet, but it will be before 3.2.0b4 goes out.) The format, matchesperpage, method and page CGI input parameters have been around from the beginning, I think, but they are CGI input only, not config attributes. The config CGI input parameter is most definitely CGI only. It wouldn't make sense to specify the config file name in a config file, would it? All this raises the question of whether we should be listing CGI input parameters in attrs.html (which is generated from defaults.cc). To me, that would tend to blur the distinction between the two. I know that in many cases, a CGI input parameter and a config attribute of the same name exist (and those config attributes should be documented), but I think it would confuse the issue if we listed CGI-only parameter names here. CGI input parameters are listed in http://www.htdig.org/hts_form.html What to other developers think about this? I'll hold off on committing this patch until this question is resolved. (Otherwise, you just know that some Linux distribution will snatch up that snapshot and we'd be hounded for a year with questions about why such and such an attribute doesn't work in the config file. :-P ) > A second patch at > http://www.ee.mu.oz.au/staff/lha/pub/patch.inputs > makes htsearch scan the existing config parameters, and=20 > overwrites them if they are given on the command line. It=20 > also has the #ifdef option of checking that there are no=20 > invalid (hence ignored) command line arguments. (This=20 > needs cgi.h to include "Dictionary.h", and I don't know=20 > how the make procedure handles dependencies, so it is=20 > disabled by default.) >=20 > Before I start testing, could you please confirm that these=20 > are on the right track? This second patch is a pretty dangerous one! The whole reason for the allow_in_form is to let you define, in a controlled manner, which attributes can be overridden by CGI input parameters (beyond those which htsearch already does by default). If I read your patch correctly, it will allow ANY config attribute to be overridden by a CGI input parameter. E.g.: http://my.victim.com/cgi-bin/htsearch?nothing_found_file=3D/etc/passwd > Finally, the "current status" emails refer to problems=20 > numbers which don't match the SourceForge problem numbers. =20 > Where can I find the original numbering? These PR# style bug numbers are from our old bug tracking database, prior to our move to SourceForge, and I don't think that database is accessible anywhere anymore. At the time of the move, I think Geoff created new bug tracking entries for old bug reports that were still opened, so the STATUS file should be updated to reflect the new numbers. > > * Not all htsearch input parameters are handled properly:=20 > > PR#648. Use a > > =A0 =A0consistant mapping of input -> config -> template for=20 > > all inputs where > > =A0 =A0it makes sense to do so (everything but "config" and =20 > > "words"?). > >=20 > > * Document all of htsearch's mappings of input parameters=20 > > to config attributes > > =A0 =A0to template variables. (Relates to PR#648.) Also make=20 > > sure these config > > =A0 =A0attributes are all documented in defaults.cc, even if=20 > > they're only set by > > =A0 =A0input parameters and never in the config file. The original PR#648 referred to the keywords input parameter, which couldn't be set to a default value by a config attribute prior to 3.1.4. So, the original bug has been fixed (and likely closed), but in the bug database comments I had suggested systematically going through all CGI input parameters and making sure htsearch handles them all consistently wherever appropriate. I.e. unless there's a reason not to, a pre-defined CGI input parameter should have a corresponding attribute that it overrides, and this attribute value should make its way into a template variable. Also, any pre-defined CGI input parameter should be processed by Display::createURL(). Likely the only ones that shouldn't be done this way are page and config. I think we're mostly there now, bu= t there may be a few stragglers left, both in the code and the documentatio= n (definitely in the latter). Note that I stress "pre-defined" CGI input parameters. You can't allow a user to use any old attribute name as an input parameter, and have that take precedence! Even allow_in_form must be used very carefully to avoid opening up big security holes (see myvictim.com URL above). It shouldn't be used for any attribute that defines part or all of a file name. The config input parameter is checked for pathname components= , but none of the other input parameters are. --=20 Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@ee...> - 2002-10-07 00:59:34
|
On Sat, 5 Oct 2002 06:39, Gilles Detillieux wrote: > According to Lachlan Andrew: > > I've updated defaults.cc to list all variables used > > by any of the programs (according to "grep config"), > > The distinction between "number" and "integer" attribute > I think we'd need to check over how all > attributes are used and label them consistently. I guessed as much. I'll go through and to that. > The code support for author_factor, caps_factor, and > url_text_factor is not complete, so I assume this is why > the attributes weren't in defaults.cc. OK. How about including them in defaults.cc with a comment noting they're incomplete? > I think it would confuse the issue if we listed > CGI-only parameter names <in attr.html>. CGI > input parameters are listed in > http://www.htdig.org/hts_form.html Fair enough. > This second patch is a pretty dangerous one! The whole > reason for the allow_in_form is to let you define... Yes, I realise it is a rather scatter-gun approach, but I wasn't aware of allow_in_form. Once I've fixed default.cc, I'll redo this patch in line with your suggestions. Thanks for the clarifications :) > allow_in_form must be used very carefully to avoid > opening up big security holes (see myvictim.com URL > above). It shouldn't be used for any attribute that > defines part or all of a file name. The config input > parameter is checked for pathname components, but none of > the other input parameters are. For "filename-related" attributes, would it be worth (a) removing a leading "/" and (b) passing it through URL.cc's code to eliminate excess "../"s? (I'll read again through the checking of "config".) That way, even if the admin *does* choose to list them in allow_in_form, the danger is minimised. Thank you both for your useful feedback :) Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-10-07 13:20:53
|
On Sunday, October 6, 2002, at 07:58 PM, Lachlan Andrew wrote: >> The code support for author_factor, caps_factor, and >> url_text_factor is not complete, so I assume this is why >> the attributes weren't in defaults.cc. > > OK. How about including them in defaults.cc with a comment > noting they're incomplete? Sounds OK to me, provided we make an entry in the feature request tracker (and the STATUS file) that these need to be implemented. > For "filename-related" attributes, would it be worth (a) > removing a leading "/" and (b) passing it through URL.cc's > code to eliminate excess "../"s? The catch is that you don't want *any* "../" or other techniques to escape the effective chroot for the CGI. And this is different for the CGI input than for the config file. It's OK if the config file itself specifies arbitrary paths because the administrator installed it themselves. But it's not OK for the CGI input to attempt to specify arbitrary paths. I don't know if there are any good ways to handle this with allow_in_form. I've basically seen allow_in_form similar to punching a hole through your firewall. The code will let you do anything, including punch a hole in your foot. The key is whether you tell the user up front about the danger. -Geoff |
|
From: Lachlan A. <lh...@ee...> - 2002-10-15 02:16:15
|
Greetings again,
A second try at the patch for defaults.cc (and now
hts_form.html and hts_template.html) is at
http://www.ee.mu.oz.au/staff/lha/pub/patch.docs
(relative to 20020922).
I've removed the CGI arguments which are synonyms of config
attributes, but left "config" and "words". As you point
out, it doesn't make sense for the config file to specify
"config", but it does make sense for there to be a
*default* value. (If defaults.cc is going to be
generated from XML, there could be a separate tag for
defaults which don't go into attr.html.) Feel free to
remove it...
The three files are hyperlinked together, although
hts_template.html doesn't have <a name=""> for entries
which are capitalisations of the CGI arguments.
Number now only refers to floating point arguments.
The text no longer points out that https:// URLs don't
have the default doc removed, but the patch doesn't fix the
bug (to make it a docs-only patch).
Cheers,
Lachlan
--
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|
|
From: Brian W. <bw...@st...> - 2002-10-15 03:18:21
|
I have been doing some stuff on defaults.xml in the background - I should have something soon. However - I noticed that defaults.cc only has an entry for "heading_factor" and not "heading_factor_1", "heading_factor_2" etc I assume this is an error.... Brian ------------------------- Brian White Step Two Designs Pty Ltd Knowledge Management Consultancy, SGML & XML Phone: +612-93197901 Web: http://www.steptwo.com.au/ Email: bw...@st... Content Management Requirements Toolkit 112 CMS requirements, ready to cut-and-paste |
|
From: Geoff H. <ghu...@ws...> - 2002-10-15 12:58:30
|
On Monday, October 14, 2002, at 10:15 PM, Brian White wrote: > However - I noticed that defaults.cc only > has an entry for "heading_factor" and not > "heading_factor_1", "heading_factor_2" etc No, this is not an error. Remember that in 3.2, each particular tag would need a separate bit in the "flags" section. So it was decided that the heading_factor attributes should be consolidated into one flag. Yes, it's a loss of search precision, but it's a savings of space. -Geoff |
|
From: Brian W. <bw...@st...> - 2002-10-16 07:35:34
|
Well, it is close to ready - I now have it successfully generating * htcommon/defaults.cc * htdocs/cf_byprog.html * htdocs/cf_buname.html * 95% of htdocs/attrs.html I still need to bundle up the changes - I was thinking of creating a patch based on 3.2.0b4 and just posting that here. At this stage, however, I have a particular question - what is the status of defaults.cc? How much has to merged in? Will there need to exist in parallel in the CVS for a peiod? I have some code that would help here ( bits of hacked together C and Perl code ) - I just want to know whether I need to include it in the bundle! Regs Brian ------------------------- Brian White Step Two Designs Pty Ltd Knowledge Management Consultancy, SGML & XML Phone: +612-93197901 Web: http://www.steptwo.com.au/ Email: bw...@st... Content Management Requirements Toolkit 112 CMS requirements, ready to cut-and-paste |
|
From: Geoff H. <ghu...@ws...> - 2002-10-16 12:49:58
|
On Wednesday, October 16, 2002, at 02:27 AM, Brian White wrote: > * 95% of htdocs/attrs.html I guess I'm not clear on what "95%" means. Does this refer to the markup that you mentioned before? > I still need to bundle up the changes - I was thinking of creating > a patch based on 3.2.0b4 and just posting that here. Yes, that's probably a good idea. > the status of defaults.cc? How much has to merged in? Will > there need to exist in parallel in the CVS for a peiod? They can't really "exist in parallel" in the CVS--your code, after all, generates defaults.cc. Certainly some amount of merging will be needed for a while, but I don' think that barrier will be too high. But certainly I think your patch will need to be checked fairly carefully for possible "gotchas" and then we'll probably need to merge in Lachlan's proposed fixes. > I have some code that would help here ( bits of hacked together C and > Perl code ) - I just want to know whether I need to include it > in the bundle! I'm assuming you mean code to help with the merging? -Geoff |
|
From: Brian W. <bw...@st...> - 2002-10-17 02:08:20
|
At 22:49 16/10/2002, Geoff Hutchison wrote: >On Wednesday, October 16, 2002, at 02:27 AM, Brian White wrote: >> * 95% of htdocs/attrs.html > >I guess I'm not clear on what "95%" means. Does this refer to the markup >that you mentioned before? Basically means that at the time of writing that email, I hadn't quite finished writing it. >>I have some code that would help here ( bits of hacked together C and >>Perl code ) - I just want to know whether I need to include it >>in the bundle! > >I'm assuming you mean code to help with the merging? I basically have a tool that, given a copy of the current form of defaults.cc, will produce a usable version of defaults.xml. It was safer to do it programmatically than by hand. I can use that tool to take a merged version of defaults.cc to produce a version of defaults.xml. The problem is that a few of the descriptions will need to be reworked quite heavily by hand to produce valid XML. I will produce my patch, including some covering documentation, and post that to this list. Should be sometime by the end of next week. Brian ------------------------- Brian White Step Two Designs Pty Ltd Knowledge Management Consultancy, SGML & XML Phone: +612-93197901 Web: http://www.steptwo.com.au/ Email: bw...@st... Content Management Requirements Toolkit 112 CMS requirements, ready to cut-and-paste |
|
From: Geoff H. <ghu...@ws...> - 2002-10-17 03:51:39
|
On Wednesday, October 16, 2002, at 07:41 PM, Brian White wrote: > I can use that tool to take a merged version of defaults.cc to produce > a version of defaults.xml. The problem is that a few of the descriptions > will need to be reworked quite heavily by hand to produce valid XML. OK, that makes sense of course. I had forgotten that you wrote a tool to generate the defaults.xml file. I would guess with some care, we can separate the "new" entries and only rework them if needed. -Geoff |
|
From: Gabriele B. <an...@ti...> - 2002-10-16 13:26:00
|
>Well, it is close to ready - I now have it successfully generating Well, first and foremost, it is the first time I express my opinion regar= ding this solution and I think it is really efficient and intelligent. Good on= ya, mate Brian! :-) Having said this, and also taking aknowledgement that I don't know how th= e XML file is structured, I want to raise the problem of 'translation' of the attributes' descriptions, uses, etc., in different languages. Any ideas? Ciao and thanks, -Gabriele |
|
From: Brian W. <bw...@st...> - 2002-10-17 02:36:44
|
At 23:25 16/10/2002, Gabriele Bartolini wrote:
> >Well, it is close to ready - I now have it successfully generating
>
>Well, first and foremost, it is the first time I express my opinion=
regarding
>this solution and I think it is really efficient and intelligent. Good on
>ya, mate Brian! :-)
>
>Having said this, and also taking aknowledgement that I don't know how the
>XML file is structured, I want to raise the problem of 'translation' of
>the attributes' descriptions, uses, etc., in different languages.
>
>Any ideas?
Yes.
Let's start with the DTD as it stands:
<!ELEMENT HtdigAttributes ( attribute+ ) >
<!-- attribute:
name : Variable Name
type : Type of Variable
programs : Whitespace separated list of programs/modules
using this attribute
block : Configuration block this can be used in ( optional )
version : Version that introduced the attribute
category : Attribute category (to split documentation)
-->
<!ELEMENT attribute( default, ( nodocs | (example+, description ) ) >
<!ATTLIST attribute name CDATA #REQUIRED
type string|integer|boolean) "string"
programs CDATA #REQUIRED
block CDATA #IMPLIED
version CDATA #REQUIRED
category CDATA #REQUIRED
>
<!-- Default value of attribute - configmacro=3D"true" would indicate the
value is actually a macro ( eg BIN_DIR )
-->
<!ELEMENT default (#PCDATA) >
<!ATTLIST default configmacro (true|false) "false" >
<!-- Basically a flag that suppresses documentation -->
<!ELEMENT nodocs EMPTY>
<!-- An example value that goes into the documentation -->
<!ELEMENT example (#PCDATA) >
<!ENTITY % paratext "#PCDATA|em|strong|a|ref" >
<!ENTITY % text "%paratext;|table|p|br|ol|ul|dl|codeblock" >
<!ELEMENT description (%paratext;) >
... + all the element for formatting the description
The first thing to do is then look at the items that might need
translation:
* description
* block
* category
* example
Analysis:
* description is the one that will always need it
* I think the values for "block" and "category" should be considered
as 'keys' rather than the actual values - they should be translated
by lookup table.
* examples will *sometimes* require translation
To this end, I would suggest changing the following
<!ELEMENT attribute ( default, ( nodocs | (example+, description ) ) >
to
<!ELEMENT attribute ( default, ( nodocs | (example*, docset+ ) ) >
<!-- lang would be the id of the language using a standard identifier,=
or
set to "default" for the default language -->
<!ELEMENT docset ( example*, description ) >
<!ATTLIST docset lang CDATA #REQUIRED >
As an example:
<attribute name=3D"no_title_text"
type=3D"string"
programs=3D"htsearch"
version=3D"3.1.0"
category=3D"Presentation:Text" >
<default>filename</default>
<example>!!!!!?</example>
<docset lang=3D"default" >
<example>No Title Found</example>
<description>This specifies the text to use in search results when=
no
title is found in the document itself. If it is set to
filename, htsearch will use the name of the file itself,
enclosed in square brackets (e.g. [index.html]).
</description>
</docset>
<docset lang=3D"fr" >
<example>Aucun titre retrouv=E9</example>
<description>Ceci sp=E9cifie le texte =E0 utiliser dans les=
r=E9sultats
d'une recherche lorsque aucun titre se trouve dedans le=
document.
Si on le r=E8gle =E0 filename, htsearch se servira du nom du=
fichier
lui-m=EAme, inclus entre crochets (p.ex. [index.html]).
</description>
</docset>
</attribute>
( And no - I don't speak french. I got a friend to do the translation
for me )
This would then put the capability into the XML file. I would need to
figure out how to do characters like =E9 - possibly as é . As to
* how this might then be used to generate documentation
* how the translated versions will be maintained
are different issues altogether!
Note that it isn't a big change, but I think we should leave
it for version 2 defaults.xml.
>Ciao and thanks,
>-Gabriele
-------------------------
Brian White
Step Two Designs Pty Ltd
Knowledge Management Consultancy, SGML & XML
Phone: +612-93197901
Web: http://www.steptwo.com.au/
Email: bw...@st...
Content Management Requirements Toolkit
112 CMS requirements, ready to cut-and-paste
|
|
From: Gabriele B. <g.b...@co...> - 2002-10-17 07:04:02
|
Ciao Brian, > Analysis: > * description is the one that will always need it Right. > * I think the values for "block" and "category" should be considered > as 'keys' rather than the actual values - they should be translated > by lookup table. I agree. Keys are better and, whenever there's another key, we just need to modify the DTD. > * examples will *sometimes* require translation Are you planning to use just the value of the attribute in the example? Usually an example is made up of just a pair (name of attribute: value). I guess we could use the value, as long as the attribute comes from the element itself. Sometimes we may need a description for the example too; for instance: accept_language: fr description: set the HTTP language to be sent to the server to 'french' What do you think? > This would then put the capability into the XML file. I would need to > figure out how to do characters like =E9 - possibly as é . As to Well ... I dunno. However this characters are not permitted in XML, if I am not wrong. XML supports unicode and doesn't know anything about SGML entities like 'é'; we should use the exact hex representation code (for latin1 code goes from 80 to D7FF). But this is an almost unknown world for me. > * how this might then be used to generate documentation > * how the translated versions will be maintained We could make up translations group. And ... if worse comes to worst, we could just use the english one. I could maintain the italian one. No troubs. > Note that it isn't a big change, but I think we should leave > it for version 2 defaults.xml. Is it a problem for you to enable translations already? Would it change much for you? Thanks a lot Brian, -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Gilles D. <gr...@sc...> - 2002-10-04 21:04:49
|
According to Lachlan Andrew: > Finally, the "current status" emails refer to problems > numbers which don't match the SourceForge problem numbers. > Where can I find the original numbering? Unless we can get the original bug reports from the old database, these pieced together descriptions from old e-mail references will have to do... PR#348 was fixed in 3.1.3. Missing or invalid port number will get set correctly. I can't find an old e-mail giving more details, but I think the bug is fixed in 3.2 as well, unless something similar was reopened by all the changes to URL.cc. PR#648 was fixed in 3.1.4. Its most relevant e-mail is this... > According to Geoff Hutchison: > > >At 8:41 AM -0700 9/20/99, Ber...@wd... wrote: > > > >I hope it's not a big problem to include the keyword statement into > > > >the pagelist > > > >variable. > > > > > >No, but it's definitely a bug. It won't make 3.1.3, which I'm > > >releasing tonight. Sorry. > > There are a few other problems with inconsistent handling of input parameters, > but this seems to be the most troublesome one. It's an ugly hack, but until > this problem is fixed, I think you can make sure the keywords parameter > gets propagated in followups by putting this attribute in your config file: > > allow_in_form: keywords > > It's really meant for allowing input parameters to override config attributes, > but a nice side effect of it is the listed input parameters get propagated. I can't find any e-mail that includes the text of PR#650, but the STATUS file has a synopsis: If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly. Subject: Re: Meta Description incorrectly flagged (PR#859) > At 1:51 PM -0400 6/11/00, rw...@in... wrote: > >In the db.worddump file, (created with htdig -t) the value in the > >"flags" column for words that appears in the meta description tags is the same > >as that for those in title. > > > >The flags column is "2". Shouldn't this be "16" as defined in > >HtWordReference.h? > > Yes, this is correct. I'll have to hunt through and find out where it > is set incorrectly. HtWordReference.h attempts to resolve the > confusion in the code between META descriptions and ht://Dig's use of > "description" to mean the link text. > > -Geoff Hope this helps. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |