At 11:34 AM 6/2/04 -0700, Bill Moseley wrote:
>On Wed, Jun 02, 2004 at 12:43:00PM -0500, Timm Murray wrote:
> > At 09:45 AM 6/2/04 -0700, Bill Moseley wrote:
> > <>
> > >> I don't think either solution is particularly difficult to implement,
> > >> but scanning the content files directly also lets us have an easier
> > >> time analyzing the structure of the document.
> > >
> > >All the server does is supply the content. Analyzing the content
> > >happens after that, regardless of using the server or the file system.
> > >Spidering lets you index the content as people see it on their browser.
> > Take a look at the system being
> > used: http://www.perlmonks.org/index.pl?node_id=357248, particularly the
> > 'Documents' subsection.
>Seems like you could out grow that one. Also seems like something
>that's been done already in many forms. You request the .inc file and
>it gets transformed by the .tmpl file. XML + XSLT? SSI? I think you can do
XML/XSLT is a mess, as is SSI. In fact, replacing SSI was exactly the goal
with this system.
I implemented a small site with SSI and then did the same site with
Apache::QuickCMS. The result was a sharp reduction in space. SSI site was
112k and 409 lines, while the Apache::QuickCMS version was 32k and 190
lines. (I can provide the tarball of the sites as implemented off-list if
anyone wants to take a look).
I actually started with the content files being in an XML format. However,
I ran into problems in coercing the XML parser that the embedded HTML
should be treated as a string (so it can be put into the template
parameter), not as more XML to be processed. While looking at how to solve
this, I thought up the POD-like solution. I recoded it without any of the
problems XML gave me, and it's probably faster, too.
I wasn't happy with having to allow <TMPL_INCLUDE> tags inside the content,
as I fear it could be easily abused in naive ways. It also slows it down
quite a bit, since it requires two passes through HTML::Template (at least,
that's how it's currently implemented). However, for some of our data, I
found we simply didn't have another choice.
>You probably already did this, but you might want to review other CMS if
>redesigning the site from scratch. Here's a few lists:
I've looked at a lot of CMS systems--it's one of those massively
over-implemented genres, much like templating systems :) I suspect the
reason is that people look at other CMSes, decide they do almost but not
quite exactly what they want, and end up implementing their own. So I
decided I would add to the mix :) Yes, it's simple; that's intentional. I
hope (possibly in vain) that I can avoid the feeping creaturism that tends
to plague other CMSes.
>There's also PHP.
PHP has other problems.
> > Now, the system allows TMPL_INCLUDE tags in the content files (actually,
> > it's implemented by passing it through HTML::Template a second time, so
> > TMPL_* tag will be processed, but this might change). Included files
> > occasionally need to be part of the search, but most likely won't. But I
> > don't feel I can make that assumption in all cases. So I need some way of
> > saying which ones should be searched on if we should ever need that
> > functionality (but default to not searching).
>And you also need a system to process your template files like
>Apache::QuickCMS does so you can index. Give spidering a try, you may
>find it's not as inefficient as you think. libxml2 is damn fast.
The processing stage isn't hard (remember, simplicity is a goal of
Apache::QuickCMS), and in any case, I think I can modify Apache::QuickCMS
quite easily so that the processing stage could be directly used by another
program. So I wouldn't need to write a seperate processor to run before
I'm not really concerned with spidering being inefficient. Worse case I
can imagine is that I set it to run before I leave one day and it's done
when I come in the next morning. It just seems to me that it's a more
clumsy solution to this problem.
(In fact, I do have a spider which runs through our entire site and jots
down what pages link to what other pages. It takes 5-10 minutes to
run. The resulting report is dumped in YAML format to be processed by
other programs to generate various reports, such as what pages link to
documents that don't exist. Which saved us a lot of time, because the boss
wanted our current site mapped by hand before we got to the redesign. Now
we have a report which is useful for stuff beyond redesigns, the least of
which is the printed version of the report (1500 pages, double-sided),
which is handy for my boss to walk into meetings to justify why we need a