You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Lachlan A. <lh...@ee...> - 2002-10-15 02:16:15
|
Greetings again,
A second try at the patch for defaults.cc (and now
hts_form.html and hts_template.html) is at
http://www.ee.mu.oz.au/staff/lha/pub/patch.docs
(relative to 20020922).
I've removed the CGI arguments which are synonyms of config
attributes, but left "config" and "words". As you point
out, it doesn't make sense for the config file to specify
"config", but it does make sense for there to be a
*default* value. (If defaults.cc is going to be
generated from XML, there could be a separate tag for
defaults which don't go into attr.html.) Feel free to
remove it...
The three files are hyperlinked together, although
hts_template.html doesn't have <a name=""> for entries
which are capitalisations of the CGI arguments.
Number now only refers to floating point arguments.
The text no longer points out that https:// URLs don't
have the default doc removed, but the patch doesn't fix the
bug (to make it a docs-only patch).
Cheers,
Lachlan
--
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|
|
From: Geoff H. <ghu...@us...> - 2002-10-13 07:13:58
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
(mifluz merge essentially finished, contact Geoff for patch to test)
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
* The new mifluz merged code is slow.
(no PR, Geoff hasn't added mifluz-merge to CVS yet.)
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#405294. (The documentation for 3.2.0b1 was updated, but can
we fix this?)
(More importantly, do we ever want exact to /not/ be specified?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch. PR#405765.
* URL.cc tries to parse malformed URLs (which can cause further problems)
(It should probably just set everything to empty).
|
|
From: Joe R. J. <jj...@cl...> - 2002-10-13 06:38:41
|
On Sat, 12 Oct 2002, Jim Cole wrote:
> Date: Sat, 12 Oct 2002 19:06:08 -0600 (MDT)
> From: Jim Cole <gre...@yg...>
> To: Joe R. Jah <jj...@cl...>
> Cc: Geoff Hutchison <ghu...@ws...>,
htd...@li...
> Subject: Re: [htdig-dev] Mifluz merge 10-10-2002
>
> Joe R. Jah's bits of Fri, 11 Oct 2002 translated to:
>
> >mifluz-merge-20021010:
> >.../htword/.libs/libhtword.so: undefined reference to `libiconv_open'
> >.../htword/.libs/libhtword.so: undefined reference to `libiconv_close'
> >.../htword/.libs/libhtword.so: undefined reference to `libiconv'
> >gmake[1]: *** [htfuzzy] Error 1
> >gmake[1]: Leaving directory `/tmp/htdig/mifluz-merge-20021010/htfuzzy'
> >gmake: *** [all-recursive] Error 1
> >
> >Nothing to run;(
>
> Do you have the iconv library installed in a standard location?
I have installed it in /usr/local, the default location.
> The library base name is the same as the symbol prefixes above
> (i.e. libiconv). It appears that configure checks for iconv, but
> continues along without complaint if it is not found (actually it
> kicks out a message "suggesting" that you install the library).
Configure actually finds it, but it still complains about types conflicts.
> I had to use the --with-libiconv-prefix= option with configure,
> but the errors I ran into when iconv weren't found were different
> than what you report above. In my case, the build failed when the
> iconv header file wasn't found.
In my case the header file was found, but it had types conflicts.
Here is an excerpt from config.log:
-------------------------------8<-------------------------------
configure:21399: checking for iconv
configure:21429: gcc -o conftest -pg -Wall -W -Wmissing-declarations -Wmissing-prototypes -I/usr/local/include -L/usr/local/lib conftest.c >&5
/var/tmp/cctBmmGZ.o: In function `main':
/var/tmp/cctBmmGZ.o(.text+0x22): undefined reference to `libiconv_open'
/var/tmp/cctBmmGZ.o(.text+0x3e): undefined reference to `libiconv'
/var/tmp/cctBmmGZ.o(.text+0x4d): undefined reference to `libiconv_close'
configure:21432: $? = 1
configure: failed program was:
#line 21407 "configure"
#include "confdefs.h"
#include <stdlib.h>
#include <iconv.h>
#ifdef F77_DUMMY_MAIN
# ifdef __cplusplus
extern "C"
# endif
int F77_DUMMY_MAIN() { return 1; }
#endif
int
main ()
{
iconv_t cd = iconv_open("","");
iconv(cd,NULL,NULL,NULL,NULL);
iconv_close(cd);
;
return 0;
}
configure:21471: gcc -o conftest -pg -Wall -W -Wmissing-declarations -Wmissing-prototypes -I/usr/local/include -L/usr/local/lib conftest.c -liconv >&5
configure:21474: $? = 0
configure:21477: test -s conftest
configure:21480: $? = 0
configure:21493: result: yes
configure:21501: checking for iconv declaration
configure:21538: gcc -c -pg -Wall -W -Wmissing-declarations -Wmissing-prototypes conftest.c >&5
configure:21516: conflicting types for `libiconv'
/usr/local/include/iconv.h:82: previous declaration of `libiconv'
-------------------------------8<-------------------------------
I am curious why previous mifluz-merge versions compiled, although
identical message was in their config.log files. Any ideas?
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Jim C. <gre...@yg...> - 2002-10-13 01:44:56
|
Geoff Hutchison's bits of Fri, 11 Oct 2002 translated to: >I posted another mifluz merge snapshot. Conveniently, it was posted I was able to build all programs under OS X (10.2). There were a number of warnings, but no errors. I logged the build and can provide the warning messages if needed. It was necessary to use the --disabled-shared option with configure. I am using the current 10.2 Developer Tools with GCC 3 selected. I tested on a small site and htdig runs to completion; however htpurge crashes on a bus error. Run under gdb, htpurge terminates as follows. WordDict::Unref() Unref on non existing word occurrence htpurge: deletion of Program received signal EXC_BAD_ACCESS, Could not access memory. 0x00074fc4 in String::get() const (this=0x97f4ac) at String.cc:244 244 Data[Length] = '\0'; // We always leave room for this. The values of Length and Data are as follows. #0 0x00074fc4 in String::get() const (this=0x97f4ac) at String.cc:244 244 Data[Length] = '\0'; // We always leave room for this. (gdb) print Length $1 = 2 (gdb) print Data $2 = 0x0 The backtrace. (gdb) bt #0 0x00074fc4 in String::get() const (this=0x97f4ac) at String.cc:244 #1 0x00075444 in String::operator<<(String const&) (this=0xbffff5c0, s=@0x97f4ac) at htString.h:259 #2 0x000088b0 in HtWordReference::Key() const (this=0x97f4a8) at ../htlib/htString.h:247 #3 0x00003850 in delete_word(WordList*, WordDBCursor&, WordReference const*, Object&) (words=0xf05bc, cursor=@0x97f4ac, word_arg=0x97f4a8, data=@0xbffff5c0) at ../htlib/htString.h:48 #4 0x00014360 in WordCursorOne::WalkNextStep() (this=0x97f470) at WordCursorOne.cc:358 #5 0x00013d34 in WordCursorOne::WalkNext() (this=0x97f470) at WordCursorOne.cc:269 #6 0x000136ac in WordCursorOne::Walk() (this=0x97f470) at WordCursorOne.cc:158 #7 0x00003a68 in purgeWords(Dictionary*) (discard_list=0xf14d0) at htpurge.cc:350 #8 0x00002eec in main (ac=2, av=0xbffffa04) at htpurge.cc:151 #9 0x0000279c in _start (argc=2, argv=0xbffffa04, envp=0xbffffa10) at /SourceCache/Csu/Csu-45/crt.c:267 #10 0x0000261c in start () at /usr/include/gcc/darwin/3.1/g++-v3/streambuf:129 The problem is repeatable. I would be happy to perform further debugging if needed. Is it necessary that htpurge be run before searching? The documentation seems to imply that it is not necessary. If this is the case, then htsearch is also having problems on this platform. Both "All" and "Any" searches result in no matches on things that should definitely be in the database. Boolean queries appear to result in a crash. The web server logs reports the following. *** malloc_zone_malloc[28390]: argument too large: -913191692 [Sat Oct 12 19:25:47 2002] [error] [client 10.0.1.217] Premature end of script headers: /Users/greyleaf/Sites/cgi-bin/htsearchm1010.cgi I did skip htpurge altogether for this test, so the problem is not due to an htpurge crash corrupting anything. Jim |
|
From: Jim C. <gre...@yg...> - 2002-10-13 01:06:12
|
Joe R. Jah's bits of Fri, 11 Oct 2002 translated to: >mifluz-merge-20021010: >.../htword/.libs/libhtword.so: undefined reference to `libiconv_open' >.../htword/.libs/libhtword.so: undefined reference to `libiconv_close' >.../htword/.libs/libhtword.so: undefined reference to `libiconv' >gmake[1]: *** [htfuzzy] Error 1 >gmake[1]: Leaving directory `/tmp/htdig/mifluz-merge-20021010/htfuzzy' >gmake: *** [all-recursive] Error 1 > >Nothing to run;( Do you have the iconv library installed in a standard location? The library base name is the same as the symbol prefixes above (i.e. libiconv). It appears that configure checks for iconv, but continues along without complaint if it is not found (actually it kicks out a message "suggesting" that you install the library). I had to use the --with-libiconv-prefix= option with configure, but the errors I ran into when iconv weren't found were different than what you report above. In my case, the build failed when the iconv header file wasn't found. Jim |
|
From: Joe R. J. <jj...@cl...> - 2002-10-12 00:27:54
|
On Fri, 11 Oct 2002, Geoff Hutchison wrote:
> Date: Fri, 11 Oct 2002 13:06:32 -0400 (EDT)
> From: Geoff Hutchison <ghu...@ws...>
> To: htd...@li...
> Subject: [htdig-dev] Mifluz merge 10-10-2002
>
> I posted another mifluz merge snapshot. Conveniently, it was posted
> October 10th, so both Europeans and Americans can agree on the date
> format. ;-)
OS: BSD/OS-4.3
mifluz-merge-20020804:
Compiled without error; every program dumped core.
mifluz-merge-20020827:
htsearch.o: In function `StringList::~StringList(void)':
/usr/src/WWW/htdig/mifluz-merge-20020827/htsearch/../htlib/Object.h(.text+0x17cc): undefined reference to `Parser::~Parser(void)'
gmake[1]: *** [htsearch] Error 1
gmake[1]: Leaving directory `/tmp/htdig/mifluz-merge-20020827/htsearch'
gmake: *** [all-recursive] Error 1
htdig ran; every other program dumped core.
mifluz-merge-20021010:
../htword/.libs/libhtword.so: undefined reference to `libiconv_open'
../htword/.libs/libhtword.so: undefined reference to `libiconv_close'
../htword/.libs/libhtword.so: undefined reference to `libiconv'
gmake[1]: *** [htfuzzy] Error 1
gmake[1]: Leaving directory `/tmp/htdig/mifluz-merge-20021010/htfuzzy'
gmake: *** [all-recursive] Error 1
Nothing to run;(
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah jj...@cl...
|
|
From: Neal R. <ne...@ri...> - 2002-10-11 20:10:34
|
Hey, I just wanted to make you aware of Valgrind. It's an Open Source memory leak & debugger. Get it here: http://developer.kde.org/~sewardj/ I have access to Purify and Insure++, which cost quite a bit of coin, and Valgrind is pretty good in comparison. I'm using on the htdig binaries with success. Thanks. Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Neal R. <ne...@ri...> - 2002-10-11 18:39:05
|
I've been working on an semi-independent merge.. you can get it here: http://ai.rightnow.com/htdig/index.html This is such a large code volume to change... the software engineer in me is nervous ;-) The way the Linux Kernel group does things for large changes is something to think about. I stopped working on the merge temporarily when the destructors caused problems in a library setting (libhtdig). Thanks On Fri, 11 Oct 2002, Geoff Hutchison wrote: > > I posted another mifluz merge snapshot. Conveniently, it was posted > October 10th, so both Europeans and Americans can agree on the date > format. ;-) > > This release fixes some compilation problems on RH 8.0 (i.e. gcc-3.2) and > should be quite stable. > > OTOH, it's also still slower than the current 3.2.0b4 snapshot. Profiling > data will come in another e-mail. > > Usual disclaimers apply. This merge requires some form of iconv call, > either from libc or the libiconv (which seems fairly portable). > > <http://www.htdig.org/files/snapshots/> > > -Geoff > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Geoff H. <ghu...@ws...> - 2002-10-11 17:13:47
|
I ran the mifluz-merge "branch" through gprof with no optimization. Honestly, the results fall in line with what I expected. The test was run on 2500 files from my file:///usr/share/doc/ directory. % cumulative self self total time seconds seconds calls s/call s/call name 6.89 6.95 6.95 3469150 0.00 0.00 CDB___bam_search 5.12 12.11 5.16 4155826 0.00 0.00 __lock_get_internal 5.04 17.19 5.08 7943668 0.00 0.00 CDB_memp_fget 4.49 21.73 4.53 38744639 0.00 0.00 CDB___bam_defpfx 4.33 26.10 4.37 53265077 0.00 0.00 CDB___bam_cmp 4.05 30.18 4.08 7943668 0.00 0.00 CDB_memp_fput 3.77 33.99 3.80 15503143 0.00 0.00 URL::~URL [in-charge]() 2.43 36.43 2.45 7918156 0.00 0.00 __lock_put_internal 2.25 38.70 2.27 31454 0.00 0.00 Server::Server[not-in-charge](URL, StringList*) 2.18 40.91 2.20 7918156 0.00 0.00 CDB___lock_getobj 2.01 42.93 2.03 15836047 0.00 0.00 CDB___lock_getlocker 1.99 44.94 2.01 7310340 0.00 0.00 CDB___db_icursor 1.88 46.83 1.89 7917891 0.00 0.00 __lock_checklocker 1.74 48.59 1.76 1814632 0.00 0.00 CDB___bam_iitem 1.74 50.35 1.75 CDB___bam_defcmp 1.70 52.06 1.71 7308751 0.00 0.00 CDB___db_c_close 1.48 53.55 1.49 4155826 0.00 0.00 CDB_lock_vec Unfortunately, I "upgraded" this machine to RH8.0, which seems to have the usual RH X.0 bugs. So I'm working on getting comparable profiling from the default 3.2.0b4 branch. OTOH, it seems like the database code is just slower. I'm going to get testing results from the native mifluz code itself this weekend--which will tell me if the problem is from Loic's code getting slower or if I've missed something in his API changes that's slowing us down. -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-10-11 17:06:29
|
I posted another mifluz merge snapshot. Conveniently, it was posted October 10th, so both Europeans and Americans can agree on the date format. ;-) This release fixes some compilation problems on RH 8.0 (i.e. gcc-3.2) and should be quite stable. OTOH, it's also still slower than the current 3.2.0b4 snapshot. Profiling data will come in another e-mail. Usual disclaimers apply. This merge requires some form of iconv call, either from libc or the libiconv (which seems fairly portable). <http://www.htdig.org/files/snapshots/> -Geoff |
|
From: Neal R. <ne...@ri...> - 2002-10-10 19:11:29
|
Go ahead and post your questions here. We'll try and answer them. Honestly you'll get better answers by posting to the group. It gives all of us the opportunity to correct any inaccuracies to answers a single developer might give. Thanks. On Thu, 10 Oct 2002, Adrian A. Edwards wrote: > Hello, I am sending this email for some assistance with htdig search > engine. Is there a number to get in contact with one of the > developer. My number is listed below. > > -- > Adrian A. Edwards > Senior Consultant > Office of Finance and Administration > DOC/NOAA/OFA/ISMO > P: 301.763.6300 x129; F: 301.763.6365 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Adrian A. E. <Adr...@no...> - 2002-10-10 17:56:20
|
Hello, I am sending this email for some assistance with htdig search engine. Is there a number to get in contact with one of the developer. My number is listed below. -- Adrian A. Edwards Senior Consultant Office of Finance and Administration DOC/NOAA/OFA/ISMO P: 301.763.6300 x129; F: 301.763.6365 |
|
From: Geoff H. <ghu...@ws...> - 2002-10-10 02:11:01
|
On Tue, 8 Oct 2002, Gilles Detillieux wrote: > The latest attempt is at... > > http://www.htdig.org/files/snapshots/README.mifluz-merge > http://www.htdig.org/files/snapshots/mifluz-merge-20020827.tar.gz Actually I have a newer version. But I found an actual *break* when compiling with gcc-3.2 (aside from the header warnings). So I'm hoping I can fix that and upload the new version soon. Still, I agree 100% with Gilles that the key here is to start with some profiling. I'd like to know why it seems so much slower than the current 3.2.0b4 version. I've done some, but more help and more eyeballs will definitely help. I can confirm that the mifluz merge is more solid, but I'd hate to having indexing slow down even more. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-10-08 22:23:05
|
According to Greg Fenton:
> I am attempting to free up some time so that I can spend it on
> enhancing htdig-3.2.0.
>
> I have a need to add some functionality, but before I do that I'd like
> to focus attention on some problems I've been seeing during database
> rebuilds.
>
> Geoff, Gilles: do you have a suggestion as to where effort would be
> best focused? I'd hate to just start banging around and find that I'm
> working on unnecessary mods or stepping on others' toes.
The number 1 priority for htdig 3.2, which is supposed to address a
lot of the database reliability problems, is the mifluz code merge.
The latest attempt is at...
http://www.htdig.org/files/snapshots/README.mifluz-merge
http://www.htdig.org/files/snapshots/mifluz-merge-20020827.tar.gz
It needs testing, debugging, profiling and optimizing to get it up to
snuff. If you can help in that effort, it would be greatly appreciated.
I haven't been able to spare the time lately.
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Greg F. <gre...@ya...> - 2002-10-08 16:05:53
|
I am attempting to free up some time so that I can spend it on enhancing htdig-3.2.0. I have a need to add some functionality, but before I do that I'd like to focus attention on some problems I've been seeing during database rebuilds. Geoff, Gilles: do you have a suggestion as to where effort would be best focused? I'd hate to just start banging around and find that I'm working on unnecessary mods or stepping on others' toes. Thanks in advance, greg_fenton. ===== Greg Fenton gre...@ya... __________________________________________________ Do you Yahoo!? Faith Hill - Exclusive Performances, Videos & More http://faith.yahoo.com |
|
From: Brian W. <bw...@st...> - 2002-10-08 02:44:34
|
Geoff,
I am still working on, but it has gone on the
backburner a bit recently.
I must confess, I ended up generating my own
initial version, which I did using some a C
for loop - I decided the C parser was the
best way to parse a C data structure!
I was thinking that
* defaults.xml and defaults.dtd
should go in htcommon
* the perl script for generating
defaults.cc should also go in
htcommon. This should be very
small and should assume that
defaults.xml is valid. This
then obviates the need for an
XML parser to be present when
rebuilding the code
* the code for generating the
HTML documentation pages should
include proper XML validation
The biggest issue I have at the moment is
- how to cope with the structure descriptions,
which are currently HTML snippets.
There are three options:
1) Store as HTML snippets, but translate <,> and &
to <,> and & - this will work, but I can
tell you it is horrible to maintain by hand!
2) Store as HTML snippets, but escaped by CDATA sections
3) Include some capacity for markup in the
XML.
What I would like to do is include some abstraction
for links to other htdig elements such as attributes,
eg:
<ref type="attr">valid_extensions</ref>
instead of
<a href="#valid_extensions">valid_extensions</a>
which makes the data much more useful, but if I do that
it prevents option (2). This would make the obvious choice
option (3) - but then what markup do you allow?
Anyway - here is my proposed DTD, which uses a very
cut down HTML for markup. Any thoughts would be
appreciated.
<!DOCTYPE HtdigAttributes [
<!ELEMENT HtdigAttributes ( attribute+ ) >
<!-- ConfigVar element is data set for each configuration variable.
Attributes:
name : Variable Name
type : Type of Variable
programs : Whitespace separated list of programs/modules
using this attribute
block : Configuration block this can be used in ( optional )
version : Version that introduced the attribute
category : Attribute category (to split documentation)
-->
<!ELEMENT attribute( default, ( nodocs | (example, description ) ) >
<!ATTLIST attribute name CDATA #REQUIRED
type string|integer|boolean) "string"
programs CDATA #REQUIRED
block CDATA #IMPLIED
version CDATA #REQUIRED
category CDATA #REQUIRED
>
<!ELEMENT default (#PCDATA) >
<!ATTLIST default configmacro (true|false) "false" >
<!-- Basically a flag that suppresses documentation -->
<!ELEMENT nodocs EMPTY>
<!ELEMENT example (#PCDATA) >
<!ENTITY % paratext "#PCDATA|em|strong|a|ref" >
<!ENTITY % text "%paratext;|table|p|br|ol|ul|dl|codeblock" >
<!ELEMENT description (%paratext;) >
<!ELEMENT p (%paratext;)* >
<!ELEMENT br EMPTY >
<!ELEMENT ol (li+) >
<!ELEMENT ul (li+) >
<!ELEMENT li (%text;)* >
<!ELEMENT table (tr+) >
<!ELEMENT tr (td|th)+ >
<!ELEMENT td (%text;)* >
<!ELEMENT th (%text;)* >
<!ATTLIST table border CDATA #IMPLIED
width CDATA #IMPLIED >
<!ATTLIST td align (
<!ELEMENT dl (dt|dd)+ >
<!ELEMENT dt (%paratext;)* >
<!ELEMENT dd (%text;)* >
<!ELEMENT strong (%text;)* >
<!ELEMENT em (%text;)* >
<!ELEMENT a (%text;)* >
<!ATTLIST a href CDATA #REQUIRED >
<!ELEMENT ref (#PCDATA)* >
]>
What I was going to suggest was that the
int main()
{
for (int i = 0; defaults[i].name; i++)
{
printf( "<ConfigVar name=\"%s\" type=\"%s\"
sinceversion=\"%s\" programs=\"%s\" \n"
" configblock=\"%s\" \n"
" category=\"%s\" >\n"
" <default>%s</default>\n"
" <example>%s</example>\n"
" <description>%s </description>\n"
"</ConfigVar>\n\n",
defaults[i].name,
defaults[i].type,
defaults[i].version,
defaults[i].programs,
defaults[i].block,
defaults[i].category,
defaults[i].value,
defaults[i].example,
defaults[i].description );
}
return 0;
}
It has gone a bit on the backburner recently, however
it is still in the pipeline. I would certainly not halt changes
to defaults.cc - we just come up with a final one at
some point.
There are several issues I have come across so far
1) Is there an XML parser available?
This is not something that should be assumed - what I
was going to suggest is a very basic script that generates
defaults.cc that assumes that defaults.xml is valid, with a
more full validating version that generates the
HTML pages.
2) Where do we put it in the hierachy - it is used in both
the
2) How do we include structured documnetation sninppets?
There are kind of three ways to do
a) HTML snippets escaped by entities - ie, have
the documentation as a snippet of HTML, but
with every "<", ">" and "&" translated to
"<" , ">" and "&".
This will work, but is as ugly as sin to
try and maintain.
b) HTML snippets escaped by CDATA marked sections
c)
One issue I realised is that attempting to use
At 08:06 5/10/2002, Geoff Hutchison wrote:
>A long time ago, in a thread far away...
>http://www.geocrawler.com/archives/3/8825/2002/9/0/9520602/
>
>Brian said:
> > I was looking at defaults.cc and I was wondering if
> > it might be better managing the info as an XML file
> > and then using that as a basis for generating
> > defaults.cc and the HTML docs.
>
>And I said:
> > Formatting defaults.cc into defaults.xml isn't hard and I'd be glad to
> > do that with some emacs macros.
>
>OK, I saw that on my TODO list today and added a defaults.xml to the
>mainline CVS. It's also gzip'ed and attached to this e-mail. It's a first
>draft, so to speak and I'm not entirely sure it's all well-formed XML or
>that the DTD is set. So please make suggestions.
>
>However, I thought I'd pass this back to Brian--are you still willing to
>write a script that'll generate a minimal defaults.cc? If so, I'll work on
>the XML file a bit more--I need to make sure nothing was deleted by my
>emacs slicing.
>-Geoff
-------------------------
Brian White
Step Two Designs Pty Ltd
Knowledge Management Consultancy, SGML & XML
Phone: +612-93197901
Web: http://www.steptwo.com.au/
Email: bw...@st...
Content Management Requirements Toolkit
112 CMS requirements, ready to cut-and-paste
|
|
From: Neal R. <ne...@ri...> - 2002-10-08 01:07:51
|
Hello, The latest snapshot of a library version of HtDig is here: http://ai.rightnow.com/htdig/ * Produces libhtdig.so A library version callable from external programs * Produces libhtdigphp.so A version callable directly from PHP scripts Bug Fixes: I've re-enabled the ability to use zlib compression for WordDB-page compression. This is usefull as a work around for those folks who run into the "FATAL ERROR:Compressor::get_vals invalid comptype" bug. Use the new config setting 'wordlist_compress_zlib: true' in your htdig.conf file. It is set to false by default. There are also a few memory leak fixes hidden in the code. This snapshot also supports building Win32 Native binaries with an alternate set of Makefiles. TODO: Merge with current htdig-3.2.0b4 snapshot and retest on Win32, Linux & Solaris I'll be updating this snapshot again in a few days. I will also try to supply updated memory leak and profiling reports. Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Geoff H. <ghu...@ws...> - 2002-10-07 13:20:53
|
On Sunday, October 6, 2002, at 07:58 PM, Lachlan Andrew wrote: >> The code support for author_factor, caps_factor, and >> url_text_factor is not complete, so I assume this is why >> the attributes weren't in defaults.cc. > > OK. How about including them in defaults.cc with a comment > noting they're incomplete? Sounds OK to me, provided we make an entry in the feature request tracker (and the STATUS file) that these need to be implemented. > For "filename-related" attributes, would it be worth (a) > removing a leading "/" and (b) passing it through URL.cc's > code to eliminate excess "../"s? The catch is that you don't want *any* "../" or other techniques to escape the effective chroot for the CGI. And this is different for the CGI input than for the config file. It's OK if the config file itself specifies arbitrary paths because the administrator installed it themselves. But it's not OK for the CGI input to attempt to specify arbitrary paths. I don't know if there are any good ways to handle this with allow_in_form. I've basically seen allow_in_form similar to punching a hole through your firewall. The code will let you do anything, including punch a hole in your foot. The key is whether you tell the user up front about the danger. -Geoff |
|
From: Lachlan A. <lh...@ee...> - 2002-10-07 00:59:34
|
On Sat, 5 Oct 2002 06:39, Gilles Detillieux wrote: > According to Lachlan Andrew: > > I've updated defaults.cc to list all variables used > > by any of the programs (according to "grep config"), > > The distinction between "number" and "integer" attribute > I think we'd need to check over how all > attributes are used and label them consistently. I guessed as much. I'll go through and to that. > The code support for author_factor, caps_factor, and > url_text_factor is not complete, so I assume this is why > the attributes weren't in defaults.cc. OK. How about including them in defaults.cc with a comment noting they're incomplete? > I think it would confuse the issue if we listed > CGI-only parameter names <in attr.html>. CGI > input parameters are listed in > http://www.htdig.org/hts_form.html Fair enough. > This second patch is a pretty dangerous one! The whole > reason for the allow_in_form is to let you define... Yes, I realise it is a rather scatter-gun approach, but I wasn't aware of allow_in_form. Once I've fixed default.cc, I'll redo this patch in line with your suggestions. Thanks for the clarifications :) > allow_in_form must be used very carefully to avoid > opening up big security holes (see myvictim.com URL > above). It shouldn't be used for any attribute that > defines part or all of a file name. The config input > parameter is checked for pathname components, but none of > the other input parameters are. For "filename-related" attributes, would it be worth (a) removing a leading "/" and (b) passing it through URL.cc's code to eliminate excess "../"s? (I'll read again through the checking of "config".) That way, even if the admin *does* choose to list them in allow_in_form, the danger is minimised. Thank you both for your useful feedback :) Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@us...> - 2002-10-06 07:14:09
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
(mifluz merge essentially finished, contact Geoff for patch to test)
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
* The new mifluz merged code is slow.
(no PR, Geoff hasn't added mifluz-merge to CVS yet.)
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#405294. (The documentation for 3.2.0b1 was updated, but can
we fix this?)
(More importantly, do we ever want exact to /not/ be specified?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch. PR#405765.
* URL.cc tries to parse malformed URLs (which can cause further problems)
(It should probably just set everything to empty).
|
|
From: Geoff H. <ghu...@ws...> - 2002-10-04 23:15:16
|
On Fri, 4 Oct 2002, Gilles Detillieux wrote: > I'll look at it next week. I guess this means any changes to defaults.cc > by Lachlan's patches will eventually be lost, unless the same changes > are made to defaults.xml too. No, I'll be glad to merge them. It's easy enough to go from his patches to the defaults.xml. I need to go through the defaults.xml thoroughly anyway to make sure there wasn't anything dropped in the conversion. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-10-04 23:11:41
|
According to Geoff Hutchison: > OK, I saw that on my TODO list today and added a defaults.xml to the > mainline CVS. It's also gzip'ed and attached to this e-mail. It's a first > draft, so to speak and I'm not entirely sure it's all well-formed XML or > that the DTD is set. So please make suggestions. > > However, I thought I'd pass this back to Brian--are you still willing to > write a script that'll generate a minimal defaults.cc? If so, I'll work on > the XML file a bit more--I need to make sure nothing was deleted by my > emacs slicing. I'll look at it next week. I guess this means any changes to defaults.cc by Lachlan's patches will eventually be lost, unless the same changes are made to defaults.xml too. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-10-04 22:56:29
|
According to Lachlan Andrew:
> Greetings again Gilles,
>
> Thanks for your feedback. I've fixed up some of the
> sloppyness in the patch (patch.kurl, relative to
> htdig-3.2.0b4-20020922) and done some fairly thorough
> testing on it. It now only removes leading slashes if
> there are at least the prescribed number. It also stores
> the slash count as '0'+slashcount rather than
> '\1'+slashcount. That retains the efficiency of using a
> single-character, but means that naive reading of the
> string gives the correct result (provided slashcount < 10).
Great! I think your new patch is bang on. I'd commit it right now
if I wasn't chicken about breaking something just before a snapshot.
I'll do it Monday, so we have a chance to test it a bit first (though
I assume you've done that already).
> I still haven't found out the "right" way to call test/url,
> but I have called it by hand, and the results match those
> of the original, with one exception. It now normalises the
> host name (e.g. resolves aliases) for *all* protocols with
> an IP host, rather than just http://. It still only
> removes "index.html" from "http://".
Yes, I think that was a bug before, and probably shouldn't be kept that
way. I don't see why URL.cc should treat http: URLs any differently
than https: URLs. Am I missing something, or shouldn't the
if (strcmp ((char*)_service, "http") == 0)
be changed to
if (strncmp ((char*)_service, "http", 4) == 0)
? Similarly, I see a couple lines in the URL::URL() constructor that
treat http: as a special case, which should likely be changed. As it
stands, any URL of the form service:relative/path will be parsed without
regard to the parent URL, unless service is http. I'm sure that's wrong
for https, and it may be wrong for other services too. Comments?
> I have also expanded the tests in test/url.cc to test the
> new functionality, and to test alias resolution. The patch
> is in "patch.test". It would be good to have a file of the
> "correct" output of this file so that changes to URL.cc
> could be tested simply with diff, but I didn't know how
> to put that in the patch.
That's a good point. As far as I can tell right now, though, the
Makefile in the test subdirectory will build url.cc but it won't run it.
I think the thing to do would be to write a t_url script, similar to the
other t_* scripts, and put all the tests and diffing you want in there.
t_url would then need to be added to the TESTS list in Makefile.am,
and url.output (or whatever you call it) would be added to EXTRA_DIST.
I've never played around with stuff in the tests directory, so I can't
say for sure that I'm not missing something.
> In the process of testing, I noticed that double slashes
> ('//') are normalised out *after* resolving '../'. This
> means that '/foo//../' resolves to '/foo/', when UNIX would
> resolve it to '/'. Is the current behaviour deliberate, or
> an oversight?
I'd say almost certainly an oversight. I guess the test above would tell
you whether changing it breaks anything. If I recall, the double slash
removal was added as an afterthought, so I don't think a lot of thought
went into placing it exactly in the right spot in the order of processing.
As I mentioned before as well, this feature needs to be configurable,
as some non-filesystem URLs need double slashes to remain intact.
> Finally, what is the schedule/criterion for the release of
> 3.2.0b4? The KDE team won't base their code on snapshots,
> and so I'd like to expedite this... Which of the items in
> Geoff's regular "current status" posts have to be addressed
> before the release?
There's no schedule as such. The standard line is "when it's ready."
The STATUS file hasn't been updated in quite some time, although Geoff
has begun updating it again today. The only items that MUST be addressed
before a beta release are the "showstoppers". However, Geoff has set
a few goals for 3.2.0b4, not listed in STATUS (yet), which will have to
be done before release. These are, at the very least, the mifluz code
update and merging in all 3.1.6 fixes and features that are applicable.
I think the new query parser may have been a goal too. I think we need
to get an "official" list of these, though, so we're all on the same page.
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Geoff H. <ghu...@ws...> - 2002-10-04 22:06:40
|
A long time ago, in a thread far away... http://www.geocrawler.com/archives/3/8825/2002/9/0/9520602/ Brian said: > I was looking at defaults.cc and I was wondering if > it might be better managing the info as an XML file > and then using that as a basis for generating > defaults.cc and the HTML docs. And I said: > Formatting defaults.cc into defaults.xml isn't hard and I'd be glad to > do that with some emacs macros. OK, I saw that on my TODO list today and added a defaults.xml to the mainline CVS. It's also gzip'ed and attached to this e-mail. It's a first draft, so to speak and I'm not entirely sure it's all well-formed XML or that the DTD is set. So please make suggestions. However, I thought I'd pass this back to Brian--are you still willing to write a script that'll generate a minimal defaults.cc? If so, I'll work on the XML file a bit more--I need to make sure nothing was deleted by my emacs slicing. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-10-04 21:31:21
|
According to Geoff Hutchison:
> On Fri, 4 Oct 2002, Gilles Detillieux wrote:
> > However, I think in practice a lot of these are actually supposed to
> > be integer-only. I think we'd need to check over how all attributes
> > are used and label them consistently.
>
> Good point.
So I guess the question is, who will do it? :-)
> > All this raises the question of whether we should be listing CGI input
> > parameters in attrs.html (which is generated from defaults.cc). To me,
>
> Hmm. I thought one reason for doing this was to allow config-file
> overrides for most, if not all CGI input. Yes, we should probably enhance
> the hts_form.html file too, if not make it auto-generated.
Sure, any CGI input parameter that overrides its corresponding config
file attribute should be documented in attrs.html, because then they
are attributes. However, for the ones where the name is different
(e.g. format->template_name, matchesperpage->matches_per_page,
method->match_method) we should only have an entry in attrs.html for
the attribute name, not the CGI parameter name (although the description
should mention the CGI name). Also, for input parameters like config,
page, and currently "words", because these are not attributes, they
probably shouldn't be listed as such. Mind you, I can't think of a single
good reason why "words" shouldn't be a config attribute (other than the
fact it would mean changing several places where it's currently read from
"input" only).
On a related note, in Display::setVariables(), SELECTED_FORMAT
and FORMAT are set from input->get("format") rather than
config->Find("match_method"), which is another inconsistency. If we do
this properly, then the only CGI input parameters that are read from the
"input" object would be config and page, except in Display::createURL()
and the section of main() that transfers them to config.
> > to avoid opening up big security holes (see myvictim.com URL above).
> > It shouldn't be used for any attribute that defines part or all of a
> > file name. The config input parameter is checked for pathname components,
> > but none of the other input parameters are.
>
> Yes, I'm not even sure the current documentation for allow_in_form is good
> enough for this yet. Perhaps it should give an explicit example of bad
> behavior with a note explaining how you can shoot yourself in the foot.
Good point.
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|