Thread: [htmltmpl] RFC: Template Tag Attributes
Brought to you by:
samtregar
From: Timm M. <tm...@ag...> - 2004-06-02 13:56:10
|
I have a project which involves changing our current pages using SSI #includes into using an HTML::Template-based solution instead. The question of how to search these templates came up, and I've suggested using htdig:// with a filter program that will convert the templates into text for indexing. Inevitably, there will be certain pages using TMPL_INCLUDE tags. I imagine that most of these will contain data that will not want to be searched for, such as footers, and therefore my filter program can simply ignore them. However, I don't feel safe in making the blanket assumption that /all/ included files don't need to be searchable. Is it possible to add a flexible attribute system to the current HTML::Template design? I'm thinking of something like this: <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> The additional attribute(s) could be ignored by HTML::Template, but hold meaning for external programs that need additional information on the template tag. I'm sure it can be done with a filter, but frankly, I think that's a cop-out solution, being that the current filter mechanism does not have a way to hook into HTML::Template's parser. |
From: Luke V. <lu...@ms...> - 2004-06-02 14:47:57
|
I use attributes in my template tags without using filters and without messing around with HTML::Template's internals. Here is the format I use: <TMPL_VAR NAME="name.arg1=val1.arg2=val2"> In my perl script I use a few different routines to manage my templates. I setup most of the subroutines so that they are called when their namesake is used as a TMPL_VAR name in a template. I created a function that binds subroutines to template tags. When a tag is present in the template, the corresponding subrouting is executed with its arguments passed to it as a hash reference. Here is some code I use for this very purpose: sub parseParameters { %parameter_hash = (); my @tags = $t->param(); # Get a list of tags present in the template foreach $tag (@tags) { $tag =~ /^([^.]*)\.?(.*)?$/; # Split the tag names into 'name' and $function = $1; # 'argument' parts $parameters = $2; if(exists $tagFunctions{$function}) { # If a function exists for this tag... foreach $pair (split(/\./, $parameters)) { $pair =~ /^([^=]*)=(.*)$/; $parameter_hash{$1} = $2; # Add its parameters to a hash } my $value = $tagFunctions{$function}->(\%parameter_hash); # call the function $t->param($tag => $value); # insert the return value into the tag } } #return 1; } # This subroutine works well for loops too. sub regTags { foreach $tag (@_) { eval "\$tagFunctions{" . $tag . "} = \\&" . $tag . ";"; } } regTags(qw(remoteInclude setVar otherFunction)); parseParameters(); Using code like this has drastically improved script creation time as the only code I need to write when I create a new script is only the code that is specific to that application. This works very well for loops too. The only drawback to this method is that you would have to create your on function for including files so that you could pass your own arguments to it. I even went as far as to create a tag function that allows me to set the value of another tag function. This helps a lot for managing headers and footers, I can set the title of the page in just one spot. I believe I also created a function to do a remote include for that same reason. So that I can choose which sub-menu file I want to include into the header file from the main template file. Luke On Wed, Jun 02, 2004 at 08:55:43AM -0500, Timm Murray wrote: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested using > htdig:// with a filter program that will convert the templates into text > for indexing. > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I imagine > that most of these will contain data that will not want to be searched for, > such as footers, and therefore my filter program can simply ignore > them. However, I don't feel safe in making the blanket assumption that > /all/ included files don't need to be searchable. > > Is it possible to add a flexible attribute system to the current > HTML::Template design? I'm thinking of something like this: > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> > > The additional attribute(s) could be ignored by HTML::Template, but hold > meaning for external programs that need additional information on the > template tag. > > I'm sure it can be done with a filter, but frankly, I think that's a > cop-out solution, being that the current filter mechanism does not have a > way to hook into HTML::Template's parser. > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the new InstallShield X. > >From Windows to Linux, servers to mobile, InstallShield X is the one > installation-authoring solution that does it all. Learn more and > evaluate today! http://www.installshield.com/Dev2Dev/0504 > _______________________________________________ > Html-template-users mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/html-template-users |
From: Sam T. <sa...@tr...> - 2004-06-02 17:12:44
|
On Wed, 2 Jun 2004, Luke Valenty wrote: > I use attributes in my template tags without using filters and without > messing around with HTML::Template's internals. Here is the format I > use: > > <TMPL_VAR NAME="name.arg1=val1.arg2=val2"> My eyes! They burn! AArrrrgh. Serriously, that's a terribly idea. Why are you using HTML::Template if you want to drive your program from the template? There are many systems better suited to this use: Mason, Template Toolkit, Embperl, etc. HTML::Template was designed to support a one-way data-flow from Perl code to HTML template. -sam |
From: Luke V. <lu...@ms...> - 2004-06-02 18:20:42
|
I wasn't planning on doing that when I started :/ It used to just be that the script would grab all of the tags in a template file and fill them in with the appropriate data. I decided it would be easiest to just write a subroutine for key tags that would automagically fill in the information. I did this to avoid having a different function for each page, it's a lot more modular this way. Then I wanted to be able to hold some information inside of a template so that the header could be filled with the title, so I needed to have a way to pass some text back to that include, and this did the trick. That way the includes for the footer and header are organized in the same way they are for the new website. Which means that I can practically copy the header and footer straight from the web include dir to the template global directory. It's all about centralization. Maybe something else would have worked better, but I rarely pass arguments to the script from the templates, and when I do, it's specifically for simple appearance adjustments. It actually works very well and allows for clear and concise code. It does abuse your module a little though, but I like it. On Wed, Jun 02, 2004 at 01:12:43PM -0400, Sam Tregar wrote: > On Wed, 2 Jun 2004, Luke Valenty wrote: > > > I use attributes in my template tags without using filters and without > > messing around with HTML::Template's internals. Here is the format I > > use: > > > > <TMPL_VAR NAME="name.arg1=val1.arg2=val2"> > > My eyes! They burn! AArrrrgh. > > Serriously, that's a terribly idea. Why are you using HTML::Template > if you want to drive your program from the template? There are many > systems better suited to this use: Mason, Template Toolkit, Embperl, > etc. HTML::Template was designed to support a one-way data-flow from > Perl code to HTML template. > > -sam > |
From: Philip T. <phi...@gm...> - 2004-06-02 15:09:29
|
Sometime Today, Timm Murray assembled some asciibets to say: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested > using htdig:// with a filter program that will convert the templates > into text for indexing. Not sure exactly what you're looking for, but one doesn't generally index Templates for searching. One indexes content, and that's generally stored in a database. -- "Remember, extremism in the nondefense of moderation is not a virtue." -- Peter Neumann, about usenet |
From: Timm M. <tm...@ag...> - 2004-06-02 16:04:34
|
At 08:39 PM 6/2/04 +0530, Philip Tellis wrote: >Sometime Today, Timm Murray assembled some asciibets to say: > > > I have a project which involves changing our current pages using SSI > > #includes into using an HTML::Template-based solution instead. The > > question of how to search these templates came up, and I've suggested > > using htdig:// with a filter program that will convert the templates > > into text for indexing. > >Not sure exactly what you're looking for, but one doesn't generally >index Templates for searching. One indexes content, and that's >generally stored in a database. In this specific case, the "templates" are actually the content stored in individual files that need to be able to include something from another file. In some cases, the include needs to be searchable, but often not. The system being used is similar to the one documented at 'http://www.perlmonks.org/index.pl?node_id=357248', which (as the comments on that page bring out) is something like Apache::Template, but with HTML::Template as the backend. |
From: Alex K. <ka...@ra...> - 2004-06-02 15:25:11
|
I'd suppose supplying your filter program with a list of exclusions (regexps or whatever). If a particular file your program is run against is in the list then do not index it. This is much more easier compared to tagging all <TMPL_INCLUDE> constructs and is flexible too. Think that you could just exclude all qr/_header\.tmpl$/ And there's of course a question whether you really want to index templates which are metadata of your website or its content. * Timm Murray <tm...@ag...> [June 02 2004, 17:55]: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested using > htdig:// with a filter program that will convert the templates into text > for indexing. > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I imagine > that most of these will contain data that will not want to be searched for, > such as footers, and therefore my filter program can simply ignore > them. However, I don't feel safe in making the blanket assumption that > /all/ included files don't need to be searchable. > > Is it possible to add a flexible attribute system to the current > HTML::Template design? I'm thinking of something like this: > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> > > The additional attribute(s) could be ignored by HTML::Template, but hold > meaning for external programs that need additional information on the > template tag. > > I'm sure it can be done with a filter, but frankly, I think that's a > cop-out solution, being that the current filter mechanism does not have a > way to hook into HTML::Template's parser. -- Alex Kapranoff. |
From: Timm M. <tm...@ag...> - 2004-06-02 16:06:31
|
At 07:23 PM 6/2/04 +0400, Alex Kapranoff wrote: > I'd suppose supplying your filter program with a list of exclusions >(regexps or whatever). If a particular file your program is run >against is in the list then do not index it. This is much more easier >compared to tagging all <TMPL_INCLUDE> constructs and is flexible too. >Think that you could just exclude all qr/_header\.tmpl$/ We wouldn't need to tag all the TMPL_INCLUDE tags. A reasonable default should be "no search". It's also possible someone will want a given include indexed on one page but not anther. > And there's of course a question whether you really want to index >templates which are metadata of your website or its content. See my reply to Phillip. >* Timm Murray <tm...@ag...> [June 02 2004, 17:55]: > > I have a project which involves changing our current pages using SSI > > #includes into using an HTML::Template-based solution instead. The > > question of how to search these templates came up, and I've suggested > using > > htdig:// with a filter program that will convert the templates into text > > for indexing. > > > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I > imagine > > that most of these will contain data that will not want to be searched > for, > > such as footers, and therefore my filter program can simply ignore > > them. However, I don't feel safe in making the blanket assumption that > > /all/ included files don't need to be searchable. > > > > Is it possible to add a flexible attribute system to the current > > HTML::Template design? I'm thinking of something like this: > > > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> > > > > The additional attribute(s) could be ignored by HTML::Template, but hold > > meaning for external programs that need additional information on the > > template tag. > > > > I'm sure it can be done with a filter, but frankly, I think that's a > > cop-out solution, being that the current filter mechanism does not have a > > way to hook into HTML::Template's parser. > >-- >Alex Kapranoff. > > >------------------------------------------------------- >This SF.Net email is sponsored by the new InstallShield X. > From Windows to Linux, servers to mobile, InstallShield X is the one >installation-authoring solution that does it all. Learn more and >evaluate today! http://www.installshield.com/Dev2Dev/0504 >_______________________________________________ >Html-template-users mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/html-template-users |
From: Puneet K. <pk...@ei...> - 2004-06-02 15:40:52
|
On Jun 2, 2004, at 8:55 AM, Timm Murray wrote: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested > using htdig:// with a filter program that will convert the templates > into text for indexing. > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. Timm, This is a bit confusing. Templates, by general definition, are templates which either don't contain any content, or contain content not worth a hoot (logo, byline, and other such junk). Essentially, templates are code. Why would you want to search them? Ostensibly you are getting data from some other source that you are "pouring" into the templates... index and search that data, and then use another template to display the results. |
From: Luke V. <lu...@ms...> - 2004-06-02 15:45:56
|
As I understand it, it looks like he's converting their existing web pages from SSI includes to use HTML::Template instead. Maybe that there is some dynamic content they want in their pages that apache just doesn't want to do. I'll be there is a better way to do this though. Luke On Wed, Jun 02, 2004 at 10:40:51AM -0500, Puneet Kishor wrote: > > On Jun 2, 2004, at 8:55 AM, Timm Murray wrote: > > >I have a project which involves changing our current pages using SSI > >#includes into using an HTML::Template-based solution instead. The > >question of how to search these templates came up, and I've suggested > >using htdig:// with a filter program that will convert the templates > >into text for indexing. > > > >Inevitably, there will be certain pages using TMPL_INCLUDE tags. > > Timm, > > This is a bit confusing. Templates, by general definition, are > templates which either don't contain any content, or contain content > not worth a hoot (logo, byline, and other such junk). Essentially, > templates are code. Why would you want to search them? Ostensibly you > are getting data from some other source that you are "pouring" into the > templates... index and search that data, and then use another template > to display the results. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the new InstallShield X. > >From Windows to Linux, servers to mobile, InstallShield X is the one > installation-authoring solution that does it all. Learn more and > evaluate today! http://www.installshield.com/Dev2Dev/0504 > _______________________________________________ > Html-template-users mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/html-template-users |
From: Bill M. <mo...@ha...> - 2004-06-02 16:01:25
|
On Wed, Jun 02, 2004 at 08:55:43AM -0500, Timm Murray wrote: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested using > htdig:// with a filter program that will convert the templates into text > for indexing. Spider your content instead. Does htdig have anything like <!-- noindex --> ? If not, you can use swish-e (http://swish-e.org) and then wrap your headers, footers and menus in <!-- noindex --> .... <!-- index --> -- Bill Moseley mo...@ha... |
From: Timm M. <tm...@ag...> - 2004-06-02 16:08:34
|
At 09:01 AM 6/2/04 -0700, Bill Moseley wrote: <> >Spider your content instead. Does htdig have anything like ><!-- noindex --> ? We discussed spidering and my boss is against the idea. I personally would prefer not to do it if we can get away with it. |
From: Bill M. <mo...@ha...> - 2004-06-02 16:22:49
|
On Wed, Jun 02, 2004 at 11:08:25AM -0500, Timm Murray wrote: > At 09:01 AM 6/2/04 -0700, Bill Moseley wrote: > <> > >Spider your content instead. Does htdig have anything like > ><!-- noindex --> ? > > We discussed spidering and my boss is against the idea. I personally would > prefer not to do it if we can get away with it. Can you explain why? I'm just curious why spidering would not be ok, but scanning the file system would be. -- Bill Moseley mo...@ha... |
From: Timm M. <tm...@ag...> - 2004-06-02 16:29:51
|
At 09:22 AM 6/2/04 -0700, Bill Moseley wrote: >On Wed, Jun 02, 2004 at 11:08:25AM -0500, Timm Murray wrote: > > At 09:01 AM 6/2/04 -0700, Bill Moseley wrote: > > <> > > >Spider your content instead. Does htdig have anything like > > ><!-- noindex --> ? > > > > We discussed spidering and my boss is against the idea. I personally > would > > prefer not to do it if we can get away with it. > >Can you explain why? I'm just curious why spidering would not be ok, >but scanning the file system would be. Extra work on the server itself. Scanning the file system is easier on the system than making a request to Apache, even on localhost. I don't think either solution is particularly difficult to implement, but scanning the content files directly also lets us have an easier time analyzing the structure of the document. In the case I'm suggesting using, the content files would be re-processed by HTML::Template for any TMPL_INCLUDE tags (however, not much has been created under this system besides some examples, so now is a good time to make incompatible changes if need be). |
From: Bill M. <mo...@ha...> - 2004-06-02 16:46:00
|
On Wed, Jun 02, 2004 at 11:29:47AM -0500, Timm Murray wrote: > At 09:22 AM 6/2/04 -0700, Bill Moseley wrote: > >On Wed, Jun 02, 2004 at 11:08:25AM -0500, Timm Murray wrote: > >> At 09:01 AM 6/2/04 -0700, Bill Moseley wrote: > >> <> > >> >Spider your content instead. Does htdig have anything like > >> ><!-- noindex --> ? > >> > >> We discussed spidering and my boss is against the idea. I personally > >would > >> prefer not to do it if we can get away with it. > > > >Can you explain why? I'm just curious why spidering would not be ok, > >but scanning the file system would be. > > Extra work on the server itself. Scanning the file system is easier on the > system than making a request to Apache, even on localhost. Set a delay on the spider or what I've done is run a separate httpd with keep-alives way high and a low number of max clients. Use a spider that supports keep-alives. If you spider then you are indexing content that people can find. And you don't have to worry about rewrites or aliases. You also don't need a separate system to generate the content just for spidering. > I don't think either solution is particularly difficult to implement, > but scanning the content files directly also lets us have an easier > time analyzing the structure of the document. All the server does is supply the content. Analyzing the content happens after that, regardless of using the server or the file system. Spidering lets you index the content as people see it on their browser. Spidering isn't as expensive as people tend to think. If you are running dynamic content as plain CGI then there's your costs. Sounds like you have mostly static content, so that shouldn't be an issue. If you can avoid spidering (say your content is in a database) then, yes, I'd always just index the content. > In the case I'm suggesting using, the content files would be > re-processed by HTML::Template for any TMPL_INCLUDE tags (however, not > much has been created under this system besides some examples, so now > is a good time to make incompatible changes if need be). <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> So the point is to have some other program not based on HTML::Template to parse that? In that case why not just look for: <!-- don't index this section --> And then use HTML::Template to generate the output for indexing. -- Bill Moseley mo...@ha... |
From: Timm M. <tm...@ag...> - 2004-06-02 17:43:06
|
At 09:45 AM 6/2/04 -0700, Bill Moseley wrote: <> > > I don't think either solution is particularly difficult to implement, > > but scanning the content files directly also lets us have an easier > > time analyzing the structure of the document. > >All the server does is supply the content. Analyzing the content >happens after that, regardless of using the server or the file system. >Spidering lets you index the content as people see it on their browser. Take a look at the system being used: http://www.perlmonks.org/index.pl?node_id=357248, particularly the 'Documents' subsection. Essentially, the actual content is stored in a POD-like format. The '=NAME' part specifies the TMPL_VAR or TMPL_IF that the given section should be placed into. By having a section '=content' or some such in every content file which specifies the regular content (this being defined as part of our coding standards), we can have the indexer focus only on that. There is no need to strip away the additional information of headers, footers, etc. that shouldn't be searched, because that information simply won't be in the text in the first place. With spidering, I either get all the content or none of it, which causes more processing. Now, the system allows TMPL_INCLUDE tags in the content files (actually, it's implemented by passing it through HTML::Template a second time, so any TMPL_* tag will be processed, but this might change). Included files occasionally need to be part of the search, but most likely won't. But I don't feel I can make that assumption in all cases. So I need some way of saying which ones should be searched on if we should ever need that functionality (but default to not searching). >Spidering isn't as expensive as people tend to think. If you are >running dynamic content as plain CGI then there's your costs. Sounds >like you have mostly static content, so that shouldn't be an issue. With this system, all "HTML" files are being run through an Apache mod_perl handler. >If you can avoid spidering (say your content is in a database) then, >yes, I'd always just index the content. > > > In the case I'm suggesting using, the content files would be > > re-processed by HTML::Template for any TMPL_INCLUDE tags (however, not > > much has been created under this system besides some examples, so now > > is a good time to make incompatible changes if need be). > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> > >So the point is to have some other program not based on HTML::Template >to parse that? In that case why not just look for: > > <!-- don't index this section --> > >And then use HTML::Template to generate the output for indexing. > > >-- >Bill Moseley >mo...@ha... |
From: Bill M. <mo...@ha...> - 2004-06-02 18:34:34
|
On Wed, Jun 02, 2004 at 12:43:00PM -0500, Timm Murray wrote: > At 09:45 AM 6/2/04 -0700, Bill Moseley wrote: > <> > >> I don't think either solution is particularly difficult to implement, > >> but scanning the content files directly also lets us have an easier > >> time analyzing the structure of the document. > > > >All the server does is supply the content. Analyzing the content > >happens after that, regardless of using the server or the file system. > >Spidering lets you index the content as people see it on their browser. > > Take a look at the system being > used: http://www.perlmonks.org/index.pl?node_id=357248, particularly the > 'Documents' subsection. Seems like you could out grow that one. Also seems like something that's been done already in many forms. You request the .inc file and it gets transformed by the .tmpl file. XML + XSLT? SSI? I think you can do better. You probably already did this, but you might want to review other CMS if redesigning the site from scratch. Here's a few lists: http://www.cmsmatrix.org/ http://www.oscom.org/matrix/ There's also PHP. > Now, the system allows TMPL_INCLUDE tags in the content files (actually, > it's implemented by passing it through HTML::Template a second time, so any > TMPL_* tag will be processed, but this might change). Included files > occasionally need to be part of the search, but most likely won't. But I > don't feel I can make that assumption in all cases. So I need some way of > saying which ones should be searched on if we should ever need that > functionality (but default to not searching). And you also need a system to process your template files like Apache::QuickCMS does so you can index. Give spidering a try, you may find it's not as inefficient as you think. libxml2 is damn fast. -- Bill Moseley mo...@ha... |
From: Timm M. <tm...@ag...> - 2004-06-02 19:19:11
|
At 11:34 AM 6/2/04 -0700, Bill Moseley wrote: >On Wed, Jun 02, 2004 at 12:43:00PM -0500, Timm Murray wrote: > > At 09:45 AM 6/2/04 -0700, Bill Moseley wrote: > > <> > > >> I don't think either solution is particularly difficult to implement, > > >> but scanning the content files directly also lets us have an easier > > >> time analyzing the structure of the document. > > > > > >All the server does is supply the content. Analyzing the content > > >happens after that, regardless of using the server or the file system. > > >Spidering lets you index the content as people see it on their browser. > > > > Take a look at the system being > > used: http://www.perlmonks.org/index.pl?node_id=357248, particularly the > > 'Documents' subsection. > >Seems like you could out grow that one. Also seems like something >that's been done already in many forms. You request the .inc file and >it gets transformed by the .tmpl file. XML + XSLT? SSI? I think you can do >better. XML/XSLT is a mess, as is SSI. In fact, replacing SSI was exactly the goal with this system. I implemented a small site with SSI and then did the same site with Apache::QuickCMS. The result was a sharp reduction in space. SSI site was 112k and 409 lines, while the Apache::QuickCMS version was 32k and 190 lines. (I can provide the tarball of the sites as implemented off-list if anyone wants to take a look). I actually started with the content files being in an XML format. However, I ran into problems in coercing the XML parser that the embedded HTML should be treated as a string (so it can be put into the template parameter), not as more XML to be processed. While looking at how to solve this, I thought up the POD-like solution. I recoded it without any of the problems XML gave me, and it's probably faster, too. I wasn't happy with having to allow <TMPL_INCLUDE> tags inside the content, as I fear it could be easily abused in naive ways. It also slows it down quite a bit, since it requires two passes through HTML::Template (at least, that's how it's currently implemented). However, for some of our data, I found we simply didn't have another choice. >You probably already did this, but you might want to review other CMS if >redesigning the site from scratch. Here's a few lists: > > http://www.cmsmatrix.org/ > http://www.oscom.org/matrix/ I've looked at a lot of CMS systems--it's one of those massively over-implemented genres, much like templating systems :) I suspect the reason is that people look at other CMSes, decide they do almost but not quite exactly what they want, and end up implementing their own. So I decided I would add to the mix :) Yes, it's simple; that's intentional. I hope (possibly in vain) that I can avoid the feeping creaturism that tends to plague other CMSes. >There's also PHP. PHP has other problems. > > Now, the system allows TMPL_INCLUDE tags in the content files (actually, > > it's implemented by passing it through HTML::Template a second time, so > any > > TMPL_* tag will be processed, but this might change). Included files > > occasionally need to be part of the search, but most likely won't. But I > > don't feel I can make that assumption in all cases. So I need some way of > > saying which ones should be searched on if we should ever need that > > functionality (but default to not searching). > >And you also need a system to process your template files like >Apache::QuickCMS does so you can index. Give spidering a try, you may >find it's not as inefficient as you think. libxml2 is damn fast. The processing stage isn't hard (remember, simplicity is a goal of Apache::QuickCMS), and in any case, I think I can modify Apache::QuickCMS quite easily so that the processing stage could be directly used by another program. So I wouldn't need to write a seperate processor to run before the indexer. I'm not really concerned with spidering being inefficient. Worse case I can imagine is that I set it to run before I leave one day and it's done when I come in the next morning. It just seems to me that it's a more clumsy solution to this problem. (In fact, I do have a spider which runs through our entire site and jots down what pages link to what other pages. It takes 5-10 minutes to run. The resulting report is dumped in YAML format to be processed by other programs to generate various reports, such as what pages link to documents that don't exist. Which saved us a lot of time, because the boss wanted our current site mapped by hand before we got to the redesign. Now we have a report which is useful for stuff beyond redesigns, the least of which is the printed version of the report (1500 pages, double-sided), which is handy for my boss to walk into meetings to justify why we need a redesign :) >-- >Bill Moseley >mo...@ha... |
From: Sam T. <sa...@tr...> - 2004-06-02 17:10:10
|
On Wed, 2 Jun 2004, Timm Murray wrote: > I have a project which involves changing our current pages using SSI > #includes into using an HTML::Template-based solution instead. The > question of how to search these templates came up, and I've suggested using > htdig:// with a filter program that will convert the templates into text > for indexing. I suggest you use a web-spider to index your site, which would of course run HTML::Template to produce output. That way you index everything the end-user sees, not just the static parts of the page. I don't know for sure but I bet htdig:// comes with a web spider. If not 'wget -r' is handy in a pinch! > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I imagine > that most of these will contain data that will not want to be searched for, > such as footers, and therefore my filter program can simply ignore > them. However, I don't feel safe in making the blanket assumption that > /all/ included files don't need to be searchable. Now you've lost me. There's lots of stuff in an HTML page that shouldn't be searched for. Stuff like headers and footers in includes is just the tip of the ice-berg. Why obsess over this? > Is it possible to add a flexible attribute system to the current > HTML::Template design? I'm thinking of something like this: > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> I'd write that: <TMPL_INCLUDE NAME="file.txt"><!-- SEARCHABLE="no" --> That way HTML::Template does its job, same as always, and your code can key on the special comment. > I'm sure it can be done with a filter, but frankly, I think that's a > cop-out solution, being that the current filter mechanism does not have a > way to hook into HTML::Template's parser. Why would you need to hook into the parser to implement this? -sam |
From: Pete P. <pet...@cy...> - 2004-06-02 21:32:54
|
Sam Tregar wrote: > On Wed, 2 Jun 2004, Timm Murray wrote: > >>I have a project which involves changing our current pages using SSI >>#includes into using an HTML::Template-based solution instead. The >>question of how to search these templates came up, and I've suggested using >>htdig:// with a filter program that will convert the templates into text >>for indexing. > I suggest you use a web-spider to index your site, which would of > course run HTML::Template to produce output. That way you index > everything the end-user sees, not just the static parts of the page. > I don't know for sure but I bet htdig:// comes with a web spider. If > not 'wget -r' is handy in a pinch! ht://Dig is good, as is mnoGoSearch... http://www.htdig.org/ http://search.mnogo.ru/ They both spider your site and create an index, I think they both have options to specify parts of your page that should not be indexed... Pete |
From: Timm M. <tm...@ag...> - 2004-06-02 17:47:00
|
At 01:10 PM 6/2/04 -0400, Sam Tregar wrote: >On Wed, 2 Jun 2004, Timm Murray wrote: > > > I have a project which involves changing our current pages using SSI > > #includes into using an HTML::Template-based solution instead. The > > question of how to search these templates came up, and I've suggested using > > htdig:// with a filter program that will convert the templates into text > > for indexing. > >I suggest you use a web-spider to index your site, which would of >course run HTML::Template to produce output. That way you index >everything the end-user sees, not just the static parts of the page. >I don't know for sure but I bet htdig:// comes with a web spider. If >not 'wget -r' is handy in a pinch! See other responses. > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I imagine > > that most of these will contain data that will not want to be searched for, > > such as footers, and therefore my filter program can simply ignore > > them. However, I don't feel safe in making the blanket assumption that > > /all/ included files don't need to be searchable. > >Now you've lost me. There's lots of stuff in an HTML page that >shouldn't be searched for. Stuff like headers and footers in includes >is just the tip of the ice-berg. Why obsess over this? I want to give content authors more control over what portions of a document are searched for. > > Is it possible to add a flexible attribute system to the current > > HTML::Template design? I'm thinking of something like this: > > > > <TMPL_INCLUDE NAME="file.txt" SEARCHABLE="no"> > >I'd write that: > > <TMPL_INCLUDE NAME="file.txt"><!-- SEARCHABLE="no" --> > >That way HTML::Template does its job, same as always, and your code >can key on the special comment. Yes, I think this is the way to go. > > I'm sure it can be done with a filter, but frankly, I think that's a > > cop-out solution, being that the current filter mechanism does not have a > > way to hook into HTML::Template's parser. > >Why would you need to hook into the parser to implement this? For the attribute syntax I originally suggested, it needs to be notified of the extra attributes (which I think would be useful for more than just this instance). More generally, filters should hook into the parser directly instead of merely being fed the content to be processed. |
From: Sam T. <sa...@tr...> - 2004-06-02 17:51:35
|
On Wed, 2 Jun 2004, Timm Murray wrote: > >Why would you need to hook into the parser to implement this? > > For the attribute syntax I originally suggested, it needs to be notified of > the extra attributes (which I think would be useful for more than just this > instance). Take a look at HTML::Template::Expr. I implemented that module, which uses new attributes to existing HTML::Template tags, without hooking into the parser. > More generally, filters should hook into the parser directly instead > of merely being fed the content to be processed. I've got plans for something like this for HTML::Template v3, but they're pretty nebulous. -sam |
From: Timm M. <tm...@ag...> - 2004-06-03 13:30:31
|
At 09:38 AM 6/3/04 +1000, Mathew Robertson wrote: > > > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. I > imagine > > > > that most of these will contain data that will not want to be > searched for, > > > > such as footers, and therefore my filter program can simply ignore > > > > them. However, I don't feel safe in making the blanket assumption that > > > > /all/ included files don't need to be searchable. > > > > > >Now you've lost me. There's lots of stuff in an HTML page that > > >shouldn't be searched for. Stuff like headers and footers in includes > > >is just the tip of the ice-berg. Why obsess over this? > > > > I want to give content authors more control over what portions of a > > document are searched for. > >This is a common mistake that information creators think 'is a good >thing'... The web got popular for a number of reasons - one of them being >"full text indexing of all content" (including headers/footers/etc). Why? There is no useful information in headers/footers. By nature of using a templating system, they are the same on every page in a given section. Including them in search results only increases the noise and the amount of information that needs to be indexed. >The point is that, it is the user of the system that wants to find the >information - not the author telling you what you can and cant search >for. Classic example -> books used to have (and still do) an index in >the last couple of pages of the book, yet the user could never find what >they were looking for; until the book made it onto CDROM at which point >full-text-searching was possible. Almost every piece of a book is useful to search. But what good would it be to search for a chapter heading? That information is already given to you in the table of contents. >-> Full text searching is a _much better_ solution to search problems than >indexing on what YOU think is the information they want. > >Mathew > >PS. This means, use a spider.. or even better use google via a "site:..." >search. Google PageRank is very good at searching a broad sample of sites. It's not so good for individual sites. |
From: Pete P. <pet...@cy...> - 2004-06-03 17:33:28
|
Timm Murray wrote: > At 09:38 AM 6/3/04 +1000, Mathew Robertson wrote: >> This is a common mistake that information creators think 'is a good >> thing'... The web got popular for a number of reasons - one of them >> being "full text indexing of all content" (including >> headers/footers/etc). > > > Why? There is no useful information in headers/footers. By nature of > using a templating system, they are the same on every page in a given > section. Including them in search results only increases the noise and > the amount of information that needs to be indexed. I've seen sites where I could read a word on a page, input it into the 'site search' box, and get no results. This tells me that the word does not exist on the site (even though I can see it) or more likely that this is not really a 'site search' but perhaps a 'content search' or 'article search' or whatever... I think people are used to using Google, Yahoo, etc. and getting results that come from *every* piece of text on the page, whether that is a good thing or bad thing can depend... it can depend on user expectations, and other things. While you say that there is no useful information in headers/footers, that may be by your definition and decision of what goes in a header and footer. This can vary from person to person. Pete |
From: Bill M. <mo...@ha...> - 2004-06-03 21:23:14
|
On Thu, Jun 03, 2004 at 12:33:18PM -0500, Pete Prodoehl wrote: > I've seen sites where I could read a word on a page, input it into the > 'site search' box, and get no results. This tells me that the word does > not exist on the site (even though I can see it) or more likely that > this is not really a 'site search' but perhaps a 'content search' or > 'article search' or whatever... Of course, if the word is on every page on the site (a global header or footer) the word is useless as a search term for helping someone find something on the site. Unless they are looking for everything. ;) There's also the rare chance that indexing the common text makes it harder to find content about something specific -- when that word might also be in the header/footer. Ranking should take care of that, though. Excluding menus is also slightly helpful when common menu items might end up as search terms. But, I would agree that in most cases it's better to index the content, and let the user worry about using good queries. Anyway, how much common content do you want on every page on the site? -- Bill Moseley mo...@ha... |