You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Gilles D. <gr...@sc...> - 2001-11-01 21:19:19
|
According to Wolfgang Mueller: > On Tuesday 30 October 2001 19:06, Geoff Hutchison wrote: > > On Tue, 30 Oct 2001, Wolfgang Mueller wrote: > > > E.g. are you using GNU Emacs for indenting? > > > > Yes, but there's no set policy. I try to uphold indenting near the code > > itself, or use the standard Emacs indenting for C++ mode. > > > > -Geoff > Hi, > I just found out that for what I want to do (a htdig plugin for the GIFT) I > have to do very very little because of the great configurability of htdig > (congrats and thanks to you): > > So in fact, what I am going to do is to call a very slightly modified version > of htsearch (stripped of the CGI stuff, slightly stripped display routines) > with a config file that produces MRML (the GIFT's communication protocol, > XML-based), which I will popen, and parse into XML to integrate it with the > GIFT's other query results. Before you go stripping out code from htsearch and end up having to maintain another search program in parallel to htsearch, be sure you fully investigate the template capabilities of htsearch. There's almost nothing in the output that you can change through some template file and/or config attribute. In fact, I recently committed to the 3.1.6 development code a number of new attributes that take care of the last few remaining areas that weren't configurable. See the new attribute search_results_contenttype in this coming Sunday's 3.1.6 snapshot, as well as the existing add_anchors_to_excerpt attribute. > What I would like to know: the summary strings you generate, are they > well-formed XML (in particular is there for each opening tag also a closing > tag? ) You probably want to turn off anchors in excerpts (although they are well-formed HTML/XHTML), and all the rest is under control of the template files. I think the distributed templates are all well-formed with closing tags. Exceptions to this right now are the <br>, <hr>, <img>, <input> and <option> tags in the header, footer, nomatch, syntax and wrapper HTML files, but you're going to change them anyway. Oh, there's also a non-self-terminating <br> tag in long.html and short.html that you may want to change to <br/> or remove. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-31 20:23:51
|
According to Andrew Daviel: > On Fri, 19 Oct 2001, Gilles Detillieux wrote: > > My question would be how much support do you need. Right now, all the > > developers are pretty strapped for time, so you probably won't get a lot > > of development work done for you. However, if you want to make changes > > to the C++ code yourself, I'm sure Geoff, myself, or a few others can > > suggest the best approaches and where to put in the changes. If they're > > done in a general enough way, they could even be incorporated into the > > 3.2 development code. > > What is the best way for me to do this ? Send you a patch file ? Put > my code tree somewhere you can look at it ? > I was using 3.2.0b3 I think but can use a later snapshot if that is > better. > I haven't looked at it for a couple of weeks and have forgotten exactly > where I got to; I had it working on a Mercator world map but need > to check the scaling for zoomed-in maps for it to be useful. For small to medium sized changes, patches against the latest snapshot have the best chance of being applied successfully to the current CVS tree. I should point out that anything that goes into the distributed source should be fairly general and configurable, and not something that would only work at one site. > > I don't know what that would involve, but it might be possible with a > > front-end wrapper script for htsearch. > > It's not so bad - map clicks are passed as form elements map.x and map.y > in the URL as usual, but I need to scale them. Actually, it's a > philosophical issue I guess - do I pass pure latitude and longitude to > htsearch, thus always requiring a wrapper, or do I pass click x,y plus > map extents, and map size in pixels in the config file, allowing me to > do the transformation of click point into position inside htsearch. > I have currently done the latter. Sounds reasonable. If it requires htsearch changes anyway, may as well make it self-contained if you can. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Wolfgang M. <Wol...@cu...> - 2001-10-31 19:47:12
|
On Tuesday 30 October 2001 19:06, Geoff Hutchison wrote: > On Tue, 30 Oct 2001, Wolfgang Mueller wrote: > > E.g. are you using GNU Emacs for indenting? > > Yes, but there's no set policy. I try to uphold indenting near the code > itself, or use the standard Emacs indenting for C++ mode. > > -Geoff Hi, I just found out that for what I want to do (a htdig plugin for the GIFT) I have to do very very little because of the great configurability of htdig (congrats and thanks to you): So in fact, what I am going to do is to call a very slightly modified version of htsearch (stripped of the CGI stuff, slightly stripped display routines) with a config file that produces MRML (the GIFT's communication protocol, XML-based), which I will popen, and parse into XML to integrate it with the GIFT's other query results. What I would like to know: the summary strings you generate, are they well-formed XML (in particular is there for each opening tag also a closing tag? ) Cheers, and thanks for your fast answers, Wolfgang -- Dr. Wolfgang Müller, assistant == teaching assistant Personal page: http://cui.unige.ch/~vision/members/WolfgangMueller.html Maintainer, GNU Image Finding Tool (http://www.gnu.org/software/gift) |
From: Geoff H. <ghu...@ws...> - 2001-10-30 18:10:26
|
On Tue, 30 Oct 2001, Wolfgang Mueller wrote: > E.g. are you using GNU Emacs for indenting? Yes, but there's no set policy. I try to uphold indenting near the code itself, or use the standard Emacs indenting for C++ mode. -Geoff |
From: Wolfgang M. <Wol...@cu...> - 2001-10-30 13:45:05
|
Hi, I am just hacking htdig a bit in order to be able to use a htdig plugin within the GIFT. For being able to send patches, I would like to know: what are your indentation standards in htdig? Is there some indent call or something equivalent I can perform before diffing? E.g. are you using GNU Emacs for indenting? Cheers, Wolfgang -- Dr. Wolfgang Müller, assistant == teaching assistant Personal page: http://cui.unige.ch/~vision/members/WolfgangMueller.html Maintainer, GNU Image Finding Tool (http://www.gnu.org/software/gift) |
From: <uta...@we...> - 2001-10-29 13:45:28
|
Hi Jamie, thanks a lot for your help. I solved my problem with 'url_rewrite_rules' and I'am very happy :-). I could also remove the value of the parameter with a regular expression: url_rewrite_rules: (.*)&_last=[0-9]*&(.*) \\1&\\2 -- Uta Becht "Jamie Anstice" <jam...@sl...> schrieb am 26.10.01: Won't this just reject the whole URL? If I understand the problem, Uta wants to throw out the session parameter but leave the rest intact, so the page if fetched once only. I've come across this issue before, and I've got a patch for 3.2.x (which I ported from our 3.1.6 version) but I think you could also use url_rewrite_rules too. My patch is somewhat more specific than url_rewrite_rules, in that it just removes unwanted parameters + value from the URL. I'll post it along with my patch for ignoring the alt text from images (which is driven from a config option) in a few days - I'm currently experimenting with tweaking the scoring to make an alternate 'or' behaviour which scores up results which contains more than one search term. Short explanation by way of example: say I'm indexing a university website. The chemistry department has a whole bunch of pages with the word chemistry all through them, and one page telling students where to buy replacement lab glassware. An 'and' search for 'chemistry glassware' finds this page and nothing else, an 'and' search for 'chemistry glassware sales' finds nothing. An 'or' search for 'chemistry glassware sales' is swamped by the occurance of 'chemistry' and 'glassware' is lost in the noise. What I'm doing is factoring in the number of distinct words from the search phrase found in the result to bump up score for pages with more than one search term. Initial results look quite promising, but will need a bit of tuning to improve search speed. Jamie Anstice Search Engineer S.L.I. Systems jam...@sl... ph: 64 961 3262 mobile: 64 21 264 9347 Geoff Hutchison <ghu...@ws...> Sent by: htd...@li... 26/10/01 01:49 To: "Uta Becht" <uta...@we...> cc: "htdig" <htd...@li...> Subject: Re: [htdig-dev] Problemes with bad query-string At 11:12 AM +0200 10/25/01, Uta Becht wrote: >Can someone give me an idea at which position of htdig I should >eleminate this bad query_parameter ?? Why not use bad_querystr: <http://www.htdig.org/attrs.html#bad_querystr> -- -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev ________________________________________________________________ Lotto online tippen! Egal zu welcher Zeit, egal von welchem Ort. Mit dem WEB.DE Lottoservice. http://tippen2.web.de/?x=13 |
From: Uta B. <ub...@ci...> - 2001-10-29 13:38:41
|
Hi Jamie, thanks a lot for your help. I solved my problem with = 'url_rewrite_rules' and I'am very happy :-). I could also remove the value of the parameter with a regular = expression: url_rewrite_rules: (.*)&_last=3D[0-9]*&(.*) \\1&\\2 -- Uta Becht -----Urspr=FCngliche Nachricht----- Von: Jamie Anstice [mailto:jam...@sl...] Gesendet: Donnerstag, 25. Oktober 2001 23:48 An: htd...@li... Betreff: Re: [htdig-dev] Problemes with bad query-string Won't this just reject the whole URL? If I understand the problem, Uta = wants=20 to throw out the session parameter but leave the rest intact, so the = page=20 if fetched once only. I've come across this issue before, and I've got a patch = for=20 3.2.x (which I ported from our 3.1.6 version) but I think you could also use=20 url_rewrite_rules too. My patch is somewhat more specific than url_rewrite_rules, in that it = just=20 removes unwanted parameters + value from the URL. I'll post it along with my=20 patch for=20 ignoring the alt text from images (which is driven from a config = option)=20 in a few days - I'm currently experimenting with tweaking the scoring to make an = alternate=20 'or' behaviour which scores up results which contains more than one search=20 term. Short explanation by way of example: say I'm indexing a university=20 website. The=20 chemistry department has a whole bunch of pages with the word chemistry = all through them, and one page telling students where to buy replacement = lab=20 glassware. An 'and' search for 'chemistry glassware' finds this page and nothing=20 else, an 'and' search for 'chemistry glassware sales' finds nothing. An 'or' search = for=20 'chemistry glassware sales' is swamped by the occurance of 'chemistry' and=20 'glassware' is lost in the noise. What I'm doing is factoring in the number of = distinct=20 words from the search phrase found in the result to bump up score for pages with = more=20 than=20 one search term. Initial results look quite promising, but will need a = bit of tuning to=20 improve search speed. Jamie Anstice Search Engineer S.L.I. Systems jam...@sl... ph: 64 961 3262 mobile: 64 21 264 9347 Geoff Hutchison <ghu...@ws...> Sent by: htd...@li... 26/10/01 01:49 =20 To: "Uta Becht" <uta...@we...> cc: "htdig" <htd...@li...> Subject: Re: [htdig-dev] Problemes with bad query-string At 11:12 AM +0200 10/25/01, Uta Becht wrote: >Can someone give me an idea at which position of htdig I should=20 >eleminate this bad query_parameter ?? Why not use bad_querystr: <http://www.htdig.org/attrs.html#bad_querystr> --=20 -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
From: Geoff H. <ghu...@us...> - 2001-10-28 07:13:52
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |
From: Jamie A. <jam...@sl...> - 2001-10-25 21:50:11
|
Won't this just reject the whole URL? If I understand the problem, Uta wants to throw out the session parameter but leave the rest intact, so the page if fetched once only. I've come across this issue before, and I've got a patch for 3.2.x (which I ported from our 3.1.6 version) but I think you could also use url_rewrite_rules too. My patch is somewhat more specific than url_rewrite_rules, in that it just removes unwanted parameters + value from the URL. I'll post it along with my patch for ignoring the alt text from images (which is driven from a config option) in a few days - I'm currently experimenting with tweaking the scoring to make an alternate 'or' behaviour which scores up results which contains more than one search term. Short explanation by way of example: say I'm indexing a university website. The chemistry department has a whole bunch of pages with the word chemistry all through them, and one page telling students where to buy replacement lab glassware. An 'and' search for 'chemistry glassware' finds this page and nothing else, an 'and' search for 'chemistry glassware sales' finds nothing. An 'or' search for 'chemistry glassware sales' is swamped by the occurance of 'chemistry' and 'glassware' is lost in the noise. What I'm doing is factoring in the number of distinct words from the search phrase found in the result to bump up score for pages with more than one search term. Initial results look quite promising, but will need a bit of tuning to improve search speed. Jamie Anstice Search Engineer S.L.I. Systems jam...@sl... ph: 64 961 3262 mobile: 64 21 264 9347 Geoff Hutchison <ghu...@ws...> Sent by: htd...@li... 26/10/01 01:49 To: "Uta Becht" <uta...@we...> cc: "htdig" <htd...@li...> Subject: Re: [htdig-dev] Problemes with bad query-string At 11:12 AM +0200 10/25/01, Uta Becht wrote: >Can someone give me an idea at which position of htdig I should >eleminate this bad query_parameter ?? Why not use bad_querystr: <http://www.htdig.org/attrs.html#bad_querystr> -- -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
From: Gilles D. <gr...@sc...> - 2001-10-25 19:02:27
|
According to Florian Hars: > On Wed, Oct 24, 2001 at 05:25:23PM -0500, Gilles Detillieux wrote: > > According to Florian Hars: > > > Things like STARSLEFT are totally different, they do not use client > > > supplied information and so are not vulnerable to cross site scripting > > > attacs. WORDS is. > > > > This is the main point I was trying to get across. > > Well, actually, no. Otherwise you wouldn't suggest to treat client-supplied > and server-supplied information in the same way: Well, I certainly don't recall ever suggesting such a thing. The fact that htsearch 3.1.4 and older did treat them the same way was indeed a problem. I was among those who first recognised it as a problem and I am the one extended htsearch's template handling so that you could treat them in different ways. I also fixed all the default templates to use the new syntax where appropriate. I would add further, though, that while client-supplied information may be tainted, it's also possible for server-supplied information to contain characters that need to be SGML-encoded. Examples of this are the URL and TITLE variables. Some server-supplied information might even be deliberately tainted, like the new METADESCRIPTION variable. (This is only an issue if you index untrusted sites, but some do.) However, the fact still remains that some internally-generated template variables cannot be SGML-encoded. Different template variables should be handled differently. That's been my point throughout this thread, whether I expressed it clearly enough before or not. We're not in disagreement on this point. The point on which we differ is how we should respond to the problem of users who still use old templates that don't handle variables differently or appropriately, or perhaps worse yet, base new templates on old, insecure templates without bothering to inform themselves about the risks. The two approaches suggested so far are: 1) (yours) Force the issue by making subsequent releases of htsearch always SGML-encode any template variable when the $(var) syntax is used. Presumably this would also involve the addition of a new syntax element for getting an unencoded variable out. The problem with this approach, as I pointed out, is... > > If we changed the behaviour of $(var) to SGML encode everything, > > it MIGHT make every exisiting template out there more secure, but it > > would almost CERTAINLY make them all unusable. I have a pretty good feel from experience about the volume of mail on the list this would generate, to say nothing of all the mail/pleas for help/flames Geoff and I would receive privately, if we adopted this approach. 2) (mine) Maintain the status quo. htsearch is now secure as distributed, so the problem only affects users who don't update their template files when updating htsearch. This would leave a lot of insecure existing htsearch implementations out there, but then there are still a lot of pre-3.1.5 htsearch implementations out there, after over a year and a half, which are far, far more insecure. In order to encourage users to update their templates, we can follow your very good suggestion... > The easiest fix would probably to document the current behaviour > appropiately, i.e. put a warning into the description of every template > variable that might contain tainted client-supplied information and should > never be used unencoded. This would certainly help the minority who actually reads the documentation (please pardon my cynicism) by stating the risks more clearly than they are now. > This will mostly be WORDS (LOGICAL_WORDS and KEYWORDS might already be > sanitized, I haven't looked at the source to verify this), and depending > on whether you can trust the sites you are indexing the variables that > display part of the indexed pages (but these look like they are already > transformed to pure text, and so not vulnerable). I think LOGICAL_WORDS is somewhat sanitized, but still it's not hurt by SGML-encoding. Some data from indexed pages should also be encoded. My "hit list" of template variables which should be SGML-encoded is: CONFIG, EXCLUDE, RESTRICT, WORDS, LOGICAL_WORDS, KEYWORDS, TITLE, URL, ANCHOR, METADESCRIPTION, DESCRIPTIONS, DESCRIPTION, SELECTED_FORMAT, SELECTED_METHOD, SELECTED_SORT. After giving it more thought, it occurred to me that as long as we're putting together a hit list like this for the documentation, we could do one better and put the hit list right in htsearch. This would lead to a third approach... 3) (the compromise kludge) Make htsearch check template variable names against the hit list above (using StringMatch) to force it to SGML-encode these variables even if the $(var) syntax is used. We'd probably still want to introduce a new syntax element to force unencoded output. (Suggestions?) I know this hit list approach is kludgy, but it does fairly neatly solve the security problems in old templates without breaking them altogether. I realize that any template variable generated by allow_in_form would potentially also be tainted, so it would also have to go on the hit list. The alternative is to make a "safe list" of variables that can't be SGML-encoded by default, but then we'd have to add any template variable generated by build_select_lists to that list. So, I put this before anyone on the developer's list who's still paying attention. What do you think? Should we take this third approach? Can you think of anything it might break? Is the hit list complete enough or are there some I missed? What should the syntax be for forcing an unencoded variable output? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2001-10-25 13:05:44
|
At 11:12 AM +0200 10/25/01, Uta Becht wrote: >Can someone give me an idea at which position of htdig I should >eleminate this bad query_parameter ?? Why not use bad_querystr: <http://www.htdig.org/attrs.html#bad_querystr> -- -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Florian H. <fl...@ha...> - 2001-10-25 07:21:19
|
On Wed, Oct 24, 2001 at 05:25:23PM -0500, Gilles Detillieux wrote: > According to Florian Hars: > > Things like STARSLEFT are totally different, they do not use client > > supplied information and so are not vulnerable to cross site scripting > > attacs. WORDS is. > > This is the main point I was trying to get across. Well, actually, no. Otherwise you wouldn't suggest to treat client-supplied and server-supplied information in the same way: > If we changed the behaviour of $(var) to SGML encode everything, > it MIGHT make every exisiting template out there more secure, but it > would almost CERTAINLY make them all unusable. The easiest fix would probably to document the current behaviour appropiately, i.e. put a warning into the description of every template variable that might contain tainted client-supplied information and should never be used unencoded. This will mostly be WORDS (LOGICAL_WORDS and KEYWORDS might already be sanitized, I haven't looked at the source to verify this), and depending on whether you can trust the sites you are indexing the variables that display part of the indexed pages (but these look like they are already transformed to pure text, and so not vulnerable). Yours, Florian. |
From: Gilles D. <gr...@sc...> - 2001-10-24 22:25:35
|
According to Florian Hars: > On Fri, Oct 19, 2001 at 09:22:00AM -0500, Gilles Detillieux wrote: > > I have to disagree with you on this point. Whether the default syntax > > is insecure or not depends totally on the context in which the template > > variable is used, and how that template variable is generated. > > No, it doesn't depend on the context. The default syntax passes client > supplied data unchanged and untested to the result page. This is something > that should never happen, under no circumstance. OK, so assuming you use what you call the default syntax (i.e. $(var)) on a template variable that contains untested client data, does that mean that client data could compromise security in any context. Perhaps, although the exploit might have to be adapted to the particular context. So, maybe context is irrelevant as far as security is concerned. However, it's not irrelevant as far as choice of encoding is concerned. E.g. you don't use hex encoding in the middle of HTML text, but you can use it in a URL inside an <a ...> tag. > Things like STARSLEFT are totally different, they do not use client > supplied information and so are not vulnerable to cross site scripting > attacs. WORDS is. This is the main point I was trying to get across. Different variables have to be used in different ways! If we changed the behaviour of $(var) to SGML encode everything, it MIGHT make every exisiting template out there more secure, but it would almost CERTAINLY make them all unusable. I don't see this as the lesser of two evils. If we were to make a design decision of deliberately breaking any old template files out there to force users to adopt a more secure configuration, why not be honest about it and adopt an entirely new syntax for all template variable substitutions? As someone who ends up answering over 50% of the support requests on this list (many from people who never so much as glance at the FAQ), I'm not about to add to my workload by taking such a radical step. The default template files all use the correct syntax for the variables they use and the context in which they're used. So, for a fresh 3.1.5 or later installation, without old template files, you should be safe from cross-site scripting attacks. As distributed, htsearch is secure by default, regardless of what we choose to call the "default syntax". For sites that don't want to update their templates, that's their choice. Given the number of users I've seen on this list that are are still using pre-3.1.5 versions, which are far more blatently insecure than 3.1.5 with old templates, it's pretty clear that a lot of users don't see security issues as a big problem for their sites. I'm not about to make that my problem. If you feel so strongly about this issue, that under no circumstance should we release htsearch as it is, I suggest you volunteer your time as a developer, put the issue to a vote among the developers, implement it as you see fit if the vote passes, and stick around to deal with the fallout. As I often say, this isn't a one-man show. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Florian H. <ha...@bi...> - 2001-10-24 12:52:18
|
On Fri, Oct 19, 2001 at 09:22:00AM -0500, Gilles Detillieux wrote: > I have to disagree with you on this point. Whether the default syntax > is insecure or not depends totally on the context in which the template > variable is used, and how that template variable is generated. No, it doesn't depend on the context. The default syntax passes client supplied data unchanged and untested to the result page. This is something that should never happen, under no circumstance. Things like STARSLEFT are totally different, they do not use client supplied information and so are not vulnerable to cross site scripting attacs. WORDS is. Yours, Florian. |
From: Geoff H. <ghu...@ws...> - 2001-10-22 12:27:11
|
At 3:18 PM +1000 10/22/01, Bramfit, Richard wrote: >I have installed Htdig 3.2.0b3 on Solaris 2.6. Unfortunately when I run >htdig, I get the following message: >ht://dig Start Time: Mon Oct 22 14:52:03 2001 >Arithmetic Exception This is a known bug in 3.2.0b3. Try a snapshot of 3.2.0b4: <http://www.htdig.org/files/snapshots/> >I have used the "stable version" of htdig on Solairis 2.6, but I have found >that the process, for some unknown reason, dies after an hour or so (the >seach results come back with the 500: Apache Internal Server Error). Hence, Could you take a look at the error log and tell us what the error messages are? Any time Apache reports an ISE, it also gives a longer error message in the Apache error log. -- -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Bramfit, R. <Ric...@au...> - 2001-10-22 05:18:45
|
I have installed Htdig 3.2.0b3 on Solaris 2.6. Unfortunately when I run htdig, I get the following message: ht://dig Start Time: Mon Oct 22 14:52:03 2001 Arithmetic Exception I have used the "stable version" of htdig on Solairis 2.6, but I have found that the process, for some unknown reason, dies after an hour or so (the seach results come back with the 500: Apache Internal Server Error). Hence, I've installed the latest version, to see if this problem has disappeared. I have also used htdig with Mandrake 8.1 (linux) and Suse 6 (linux), both of which work fine. I'll try and give you more info of the Solaris 2.6 set up if you require it. Richard Bramfit |
From: Geoff H. <ghu...@us...> - 2001-10-21 07:13:50
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |
From: Gilles D. <gr...@sc...> - 2001-10-19 16:31:42
|
According to Jamie Anstice: > Here's another quickie patch to 3.2.x. It's support for > a content-type alias attribute for htdig which sorts out > servers that get the content-type wrong in responses. > While getting the server correctly configured is probably > a better solution, this works if the server is maintained > by someone else. > > usage is something like: > content_type_aliases: text/plain=text/html > and can be set for specific servers in the server block. > > Also included is the patch as an attachment for when the c+p gets mangled. > > Is the config file entry in the right format? I've not seen a definitive > description of what should be where, so it's a bit of a guess. We don't really have a definitive reference for the format of entries in htcommon/defaults.cc. The test is to cd to htdoc and run cf_generate.pl, and if that works, you probably have the right format. It appears you followed the example of other entries faithfully. The "SLI-special" would have to be changed to the current version number if this were commited to CVS. By the way, the patch is missing the changes needed in htdig/Document.h, to correspond to Document.cc changes. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-19 15:29:08
|
According to Andrew Daviel: > I have been working on a geographic search engine, and as I mentioned in > my earlier "features" message, I am getting fed up with my Perl robot, and > am trying to use htdig. > > I have been somewhat successful and have a version without map > navigation at http://geotags.com/htdig/ > > My question is really how much support, if any, I might get from the htdig > community for this, either as mainstream htdig or from anyone else > interested in this kind of thing. My question would be how much support do you need. Right now, all the developers are pretty strapped for time, so you probably won't get a lot of development work done for you. However, if you want to make changes to the C++ code yourself, I'm sure Geoff, myself, or a few others can suggest the best approaches and where to put in the changes. If they're done in a general enough way, they could even be incorporated into the 3.2 development code. > The changes required are (so far) > > A database element to store a position for each page > (I actually store a region code and placename too but don't use them) This is one of the more involved changes, but certainly feasible. It would require extending the DocumentRef and DocumentDB classes in htcommon, to handle the new field. To maintain compatibility, you'd want to add the new field code to the end of the enumeration in DocumentRef, to avoid shifting over the other codes. > An addition to the HTML parser to get the metadata Should be quite easy. This would likely involve both the HTML and Retriever classes in htdig. > An addition to the CGI parser to get a requested position (map click) I don't know what that would involve, but it might be possible with a front-end wrapper script for htsearch. > A weighting algorithm to calculate geographic distance You'd need to work out the specifics of the calculations. Right now, the scoring is done in Display::buildMatchList() in htsearch, but this code may get reorganized in the next month or two (in the 3.2 betas). > I also have a config item to essentially force a ROBOTS NONE if there > is no geographic tag on a page, so that I can refrain from indexing > untagged pages. Code to support this can go in the HTML class. Adding config attributes is pretty easy in 3.2, as everything for defining and documenting them goes into htcommon/defaults.cc. (Lots of examples to choose from in there!) > I am also trying to add support for position passed in an experimental > HTTP header, which allows one to dispense with the map and potentially > generate requests based on current position automatically, e.g. > using GPS. This would affect the HtHTTP and Document classes in htdig, as well as maybe the Retriever class. Sounds like an interesting project. I hope you have a C++ programmer to help you get the changes into the code. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2001-10-19 14:22:08
|
According to Florian Hars: > On Thu, Oct 18, 2001 at 01:13:38PM -0500, Gilles Detillieux wrote: > > The added "&" after the "$" in 3.1.5 template files causes the template > > variable to be SGML-encoded. I suspect that the debian release of > > htdig didn't bother updating the template files it installs, but instead > > installs something they customized from an earlier version of htdig. > > No, the files in debian contain properly escaped substitutions, it were just > my templates that had the problem. Still, I think it is a design error to > make the default syntax for variable substitution (which is the same every > other program uses) insecure. > You should have to take additional steps if you want insecure behaviour, > not if you want secure behaviour. I have to disagree with you on this point. Whether the default syntax is insecure or not depends totally on the context in which the template variable is used, and how that template variable is generated. To change the default syntax so it SGML encodes the variable by default would seriously break just about any existing template file, because a great number of template variables can't be encoded this way - they're supposed to contain HTML tags that go straight through to the results page. Just a few of these variables are: METHOD, FORMAT, SORT, PAGEHEADER, PREVPAGE, PAGELIST, NEXTPAGE, EXCERPT, STARSLEFT and STARSRIGHT. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Florian H. <fl...@ha...> - 2001-10-19 07:16:12
|
On Thu, Oct 18, 2001 at 01:13:38PM -0500, Gilles Detillieux wrote: > The added "&" after the "$" in 3.1.5 template files causes the template > variable to be SGML-encoded. I suspect that the debian release of > htdig didn't bother updating the template files it installs, but instead > installs something they customized from an earlier version of htdig. No, the files in debian contain properly escaped substitutions, it were just my templates that had the problem. Still, I think it is a design error to make the default syntax for variable substitution (which is the same every other program uses) insecure. You should have to take additional steps if you want insecure behaviour, not if you want secure behaviour. Yours, Florian Hars. |
From: Gilles D. <gr...@sc...> - 2001-10-18 18:13:46
|
According to Florian Hars: > I just sent this to bugtraq: > > In Fri, Oct 12, 2001 at 12:59:13PM -0600, Dave Ahmad wrote: > > On Thu, 11 Oct 2001, bugtraq wrote: > > > http://www.perl.com/search/index.ncsp?sp-q=%3C%69%6D%67%20%73%72%63%3D%68%74%74%70%3A%2F%2F%31%39%39%2E%31%32%35%2E%38%35%2E%34%36%2F%74%69%6D%65%2E%6A%70%67%3E > > > Does anyone know which search engine software this is? Doesn't LOOK like ht://Dig, but it can be hard to tell with the wrappers some people use. In any case, it would seem they resolved the problem on their site. > I don't know which engine perl.com uses, but if you have the template > parameter WORDS in you templates, htdig 3.1.5 puts the unquoted img-tag > into the result page. > > Funnily enough, the htdig 3.1.5 on htdig.org encodes the offending string > in > <input type="text" size="30" name="words" value="<img src=http://199.125.85.46/time.jpg>"> > > while the distributed htdig 3.1.5 (here the debian-version 3.1.5-2) doesn't: > > <input type="text" size="30" name="words" value="<img src=http://199.125.85.46/time.jpg>"> It all depends on whether the "words" input field in your followup search forms (template files header.html, nomatch.html, ...) use: <input type="text" size="30" name="words" value="$&(WORDS)"> or the older (pre-3.1.5) syntax: <input type="text" size="30" name="words" value="$(WORDS)"> The added "&" after the "$" in 3.1.5 template files causes the template variable to be SGML-encoded. I suspect that the debian release of htdig didn't bother updating the template files it installs, but instead installs something they customized from an earlier version of htdig. That's out of our hands, so you should report this to the Debian folks. > (And there is neither a security section on htdig.org nor an email address > for bug reports... so I am crossposting this to htdig-general) Yes, we had talked about adding a security section, but no one stepped forward to help write it. E-mailing bug reports to htdig-general is just fine by me, because most of the "bugs" reported on ht://Dig's SourceForge bug tracking system end up being configuration problems or things that have been fixed a long time ago. Both of these are easier to discuss on the mailing list. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |