[Edocs-development] Re: Document Search

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

First, hello everyone.  I've joined the group to help develop the Microsoft
Office integration components, or at least get them kick-started.  :-)

"Ashwini Kumar" wrote on 2002-10-02,
> I think we need to decide how complicated we would like the search
> for eDocs to get. Maybe for the time being we should simply stick to
> the RDF framework and make sure that the document creator
> associates a metadata with the document.

I agree with your proposal of RDF for the time being, Ashwini.

WHY RDF?

RDF is a mature standard and better-understood than ontologies.  It's
easier (from a software development standpoint) to employ a simpler
metadata approach.

I'd like to see a first working release of eDocs as soon as possible,
versus a full-featured release that bogs down in complexity and which
might not "get out the proverbial door."   That's why I'd support a minimal
feature set for Version 1, so that Version 2, and 3, etc., can evolve into
the eDocs Document Management System Sergio has envisioned with
help from a supportive user community.

THE MICROSOFT OFFICE INTEGRATION VIEW:
A METADATA SEARCH IS BETTER

A metadata search approach is easier to implement, less demanding
of network bandwidth in the size of serialized SOAP requests, and
would be much more reliable across the breadth of "document" types
business users will use eDocs with from the Microsoft Office suite
of products.

Speaking from my investigation of a Microsoft Word Add-in, to store
and retrieve a Word document, it would be easiest to provide meta-
data (the fields a user optionally fills in as the Document's Properties
in Word: title, keywords, author, manager, version, etc.) and then
a BASE64 octet-stream that could be stored on the server as an
opaque package ("black box").  This wouldn't facilitate full-text
search though, and serializing the Word Document's Object Model
would be extraordinarily heavyweight (also brittle to different releases
of Word, and labor-intensive to code and maintain pre dot NET.)

I can foresee sending metadata, the bare text of the Word
document as a (very long) xsd:string element so eDocs can
search the bare, unadorned text, and an octet-stream (which
would still be stored as a black box).  This may not be as
applicable to other Microsoft Office applications, like Visio
diagrams [eg, I want to search for all UML Class Diagrams
in eDocs concerning class name "*LexicalHandler".]

Actually, Visio 2000 doesn't expose that in it's Object Model
(spent a weekend trying to get at it to write a code-gen Addin)
though it may be possible to get at it after the internal COM
object it uses to handle UML in Visio serializes itself into
the .VSD file format.  In that case, I wouldn't want the
octet-stream to be opaque to the repository.  :-)   Or, I
might use the VBA FileSystemObject in the Add-in to
save the diagram to disk in a TMP folder, scan it for text,
and forward that text separately to be searchable (then
the octet-stream could be opaque, I suppose we want
to shoot for consistency there... )

In any event, I definately see issues with searching
Visio diagrams for text.  Support would be incomplete,
at best. (My recommended practice, if a user has a UML
Class Diagram in Visio 2000, he or she must list all
applicable Class Names as keywords for metadata to
search for them reliably.  Visio XP, I think, can export
UML to XMI, it might need a Microsoft patch to do it,
but that could work better.)

PROBLEM WITH ADDING FULL-TEXT SEARCH LATER

On the other hand (isn't it awful, having two hands? ;-) ).

The one problem with adding the sophisticated, full-text search
capability later is Migration for upgrading users.  We would need
either:

    1. A Migration Plan for going from a keyword-oriented search
    facility to a full text-oriented search facility in the future.  This
    would probably involve something tantamount to checking-out
    and checking-in all revisions (or having smart differencing) of
    all (the latest) documents.  :-(

    2. Not support migration formally.  Perhaps allow existing
    documents in a company's repository catalogued with keyword
    search to remain available for keyword searchs, but be excluded
    from newer comprehensive text searchs.  But for a pre-existing
    document revision to be subject to comprehensive text search
    in an upgrading organization, the user would have to check-out
    and check-in under the new version.

So, there could be a problem going from a simpler metadata to
a more comprehensive metadata in the future for early-adopters.
Administrators will be unhappy if it's not easy to migrate.

CONCLUSION

I'd still choose the first hand: RDF and searching on metadata
about repository documents to begin with.  It increases the
likelihood eDocs 1.0 happens (if eDocs 1.0 doesn't happen,
there'll be no first release to migrate from and so nobody has
that problem).

Derek Harmon
sto...@us...