Thread: Re: [htmltmpl] RFC: Template Tag Attributes (Page 2)
Brought to you by:
samtregar
From: Pete P. <pet...@cy...> - 2004-06-04 12:30:21
|
Bill Moseley wrote: > On Thu, Jun 03, 2004 at 12:33:18PM -0500, Pete Prodoehl wrote: > >>I've seen sites where I could read a word on a page, input it into the >>'site search' box, and get no results. This tells me that the word does >>not exist on the site (even though I can see it) or more likely that >>this is not really a 'site search' but perhaps a 'content search' or >>'article search' or whatever... > > Of course, if the word is on every page on the site (a global header or > footer) the word is useless as a search term for helping someone find > something on the site. Unless they are looking for everything. ;) Well, with properly weighing, titles, <h1>, etc. being of more importance, this shouldn't be the case... > There's also the rare chance that indexing the common text makes it > harder to find content about something specific -- when that word might > also be in the header/footer. Ranking should take care of that, though. Yup... ;) > Excluding menus is also slightly helpful when common menu items might end up > as search terms. Yup... > But, I would agree that in most cases it's better to index the content, > and let the user worry about using good queries. Anyway, how much > common content do you want on every page on the site? How much do *I* want on every page, or how much to the "powers that be" want on every page? ;) Pete |
From: Timm M. <tm...@ag...> - 2004-06-04 12:31:45
|
At 09:02 AM 6/4/04 +1000, Mathew Robertson wrote: <> > > Google PageRank is very good at searching a broad sample of sites. It's > > not so good for individual sites. > >You are kidding right? The algorithms the Google use, already take into >account common content on multiple pages. OpenOffice.org use Google as >their own site specific search engine. As do a number of sites. The only >real problem with using Google is that they only spider the web every few >weeks, thus if you update more frequently than that, you may have a problem. No, I'm not kidding. It's not common content that Google has a problem which makes it less than perfect for individual sites. The whole point behind PageRank is that you get a lot of sites linking to each other and are given rank based on the link count. On a smaller sampling, there simply isn't enough data to be fed into PageRank to make a good rank. Is it still useful? Sure it is, but it's also not as good as other solutions for this particular problem. |
From: Mathew R. <mat...@re...> - 2004-06-07 00:24:19
|
> > > Google PageRank is very good at searching a broad sample of sites. = It's > > > not so good for individual sites. > > > >You are kidding right? The algorithms the Google use, already take = into=20 > >account common content on multiple pages. OpenOffice.org use Google = as=20 > >their own site specific search engine. As do a number of sites. The = only=20 > >real problem with using Google is that they only spider the web every = few=20 > >weeks, thus if you update more frequently than that, you may have a = problem. >=20 > No, I'm not kidding. It's not common content that Google has a = problem=20 > which makes it less than perfect for individual sites. The whole = point=20 > behind PageRank is that you get a lot of sites linking to each other = and=20 > are given rank based on the link count. On a smaller sampling, there=20 > simply isn't enough data to be fed into PageRank to make a good rank. = Is=20 > it still useful? Sure it is, but it's also not as good as other = solutions=20 > for this particular problem. PageRank is not the only ranking algorithm in use during Google searches = -> this is particularily the case when a 'site:...' search is done. As with most other search engines, you can submit your site/url to = google so that it is spidered then next time google updates its database = -> you dont _need_ to be linked to, to get ranked (although it helps). Mathew |
From: Mathew R. <mat...@re...> - 2004-06-02 23:38:56
|
> > > Inevitably, there will be certain pages using TMPL_INCLUDE tags. = I imagine > > > that most of these will contain data that will not want to be = searched for, > > > such as footers, and therefore my filter program can simply ignore > > > them. However, I don't feel safe in making the blanket assumption = that > > > /all/ included files don't need to be searchable. > > > >Now you've lost me. There's lots of stuff in an HTML page that > >shouldn't be searched for. Stuff like headers and footers in = includes > >is just the tip of the ice-berg. Why obsess over this? >=20 > I want to give content authors more control over what portions of a=20 > document are searched for. This is a common mistake that information creators think 'is a good = thing'... The web got popular for a number of reasons - one of them = being "full text indexing of all content" (including = headers/footers/etc). The point is that, it is the user of the system that wants to find the = information - not the author telling you what you can and cant search = for. Classic example -> books used to have (and still do) an index in = the last couple of pages of the book, yet the user could never find what = they were looking for; until the book made it onto CDROM at which point = full-text-searching was possible. -> Full text searching is a _much better_ solution to search problems = than indexing on what YOU think is the information they want. Mathew PS. This means, use a spider.. or even better use google via a = "site:..." search. |
From: Mark F. <mar...@ea...> - 2004-06-03 18:30:12
|
From: "Pete Prodoehl" <pet...@cy...> > I've seen sites where I could read a word on a page, input it into the > 'site search' box, and get no results. Some search/indexing tools exclude common and site-defined words. I used a really nice local indexing tool called WebGlimpse[1] and it had this ability (as well as fuzzy word matches). > While you say that there is no useful information in headers/footers, > that may be by your definition and decision of what goes in a header and > footer. This can vary from person to person. A good indexer would add weight to words that appear in headers/footers which also appear in title, meta keyword/description, headings, and expository content (paragraphs, list items, data terms and definitions). Conversely, reduce weight for words appearing in headers/footers (or any part of the document) that doesn't appear in outline/structural elements. What I really hate is all words presented to me without any weighting of relevance to the topic of the page (determined from its structual elements). If it were me, I would index the entire site and work on the relevancy algorithem. For example, perhaps the footer should be enclosed in a <div class=footer> and the indexer programmed to omit such classes of divisions. FWIW: I don't think Tim's attempt to get more flexible/enabling features will go very far. I can't even get an attribute indicating that linefeeds only for the readbility of <tmpl> tags not remain when the tags are removed by h::t. It's like a festering wound for me. :) [1] http://webglimpse.net/ Mark |
From: Timm M. <tm...@ag...> - 2004-06-03 19:43:46
|
At 11:30 AM 6/3/04 -0700, Mark Fuller wrote: <> >A good indexer would add weight to words that appear in headers/footers >which also appear in title, meta keyword/description, headings, and >expository content (paragraphs, list items, data terms and definitions). >Conversely, reduce weight for words appearing in headers/footers (or any >part of the document) that doesn't appear in outline/structural elements. >What I really hate is all words presented to me without any weighting of >relevance to the topic of the page (determined from its structual elements). <> We're pretty set on using htdig://, since we already use it for a small subsection of our site. We discussed using Google site:, but (as I mentioned in another e-mail), Google's algorithm isn't very good for searching a specific site. What I had planned on doing with Apache::QuickCMS is specify in our site coding standards that the template must provide certain variables for the content. Specifically, title, meta-description, meta-keyword, and content. Each would go in the obvious place in the HTML document. This is specifically to help the search indexer do the Right Thing. |
From: Mathew R. <mat...@re...> - 2004-06-03 23:03:17
|
> >This is a common mistake that information creators think 'is a good=20 > >thing'... The web got popular for a number of reasons - one of them = being=20 > >"full text indexing of all content" (including headers/footers/etc). >=20 > Why? There is no useful information in headers/footers. By nature of = > using a templating system, they are the same on every page in a given=20 > section. Including them in search results only increases the noise = and the=20 > amount of information that needs to be indexed. Only says you - a user of your site may find the headers and footers to = be very useful. > >The point is that, it is the user of the system that wants to find = the=20 > >information - not the author telling you what you can and cant search = > >for. Classic example -> books used to have (and still do) an index = in=20 > >the last couple of pages of the book, yet the user could never find = what=20 > >they were looking for; until the book made it onto CDROM at which = point=20 > >full-text-searching was possible. >=20 > Almost every piece of a book is useful to search. But what good would = it=20 > be to search for a chapter heading? That information is already given = to=20 > you in the table of contents. An index is a whole different beast to a table of contents. And as you = say, every peice of a book is useful to search -> thus why bother only = indexing content that the author considers valid? Why not just index = _everything_, then let the user decide... > >-> Full text searching is a _much better_ solution to search problems = than=20 > >indexing on what YOU think is the information they want. > > > >Mathew > > > >PS. This means, use a spider.. or even better use google via a = "site:..."=20 > >search. >=20 > Google PageRank is very good at searching a broad sample of sites. = It's=20 > not so good for individual sites. You are kidding right? The algorithms the Google use, already take into = account common content on multiple pages. OpenOffice.org use Google as = their own site specific search engine. As do a number of sites. The = only real problem with using Google is that they only spider the web = every few weeks, thus if you update more frequently than that, you may = have a problem. Mathew |