Re: [Edocs-development] Re: Document Search

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> First, hello everyone.  I've joined the group to help develop the
Microsoft
> Office integration components, or at least get them kick-started.  :-)
>
> "Ashwini Kumar" wrote on 2002-10-02,
> > I think we need to decide how complicated we would like the search
> > for eDocs to get. Maybe for the time being we should simply stick to
> > the RDF framework and make sure that the document creator
> > associates a metadata with the document.
>
> I agree with your proposal of RDF for the time being, Ashwini.
>
>
> WHY RDF?
>
> RDF is a mature standard and better-understood than ontologies.  It's
> easier (from a software development standpoint) to employ a simpler
> metadata approach.

I agreed on following this approach for our document search strategy.

> I'd like to see a first working release of eDocs as soon as possible,
> versus a full-featured release that bogs down in complexity and which
> might not "get out the proverbial door."   That's why I'd support a
minimal
> feature set for Version 1, so that Version 2, and 3, etc., can evolve into
> the eDocs Document Management System Sergio has envisioned with
> help from a supportive user community.

We hope to have it soon. But now I would like to concentrate on analysis
and design and I absolutely want to finish that phase by the end of this
month.
After that we can think about implementing a prototype.

> A metadata search approach is easier to implement, less demanding
> of network bandwidth in the size of serialized SOAP requests, and
> would be much more reliable across the breadth of "document" types
> business users will use eDocs with from the Microsoft Office suite
> of products.

Right

>
> Speaking from my investigation of a Microsoft Word Add-in, to store
> and retrieve a Word document, it would be easiest to provide meta-
> data (the fields a user optionally fills in as the Document's Properties
> in Word: title, keywords, author, manager, version, etc.) and then
> a BASE64 octet-stream that could be stored on the server as an
> opaque package ("black box").  This wouldn't facilitate full-text
> search though, and serializing the Word Document's Object Model
> would be extraordinarily heavyweight (also brittle to different releases
> of Word, and labor-intensive to code and maintain pre dot NET.)
>
> I can foresee sending metadata, the bare text of the Word
> document as a (very long) xsd:string element so eDocs can
> search the bare, unadorned text, and an octet-stream (which
> would still be stored as a black box).  This may not be as
> applicable to other Microsoft Office applications, like Visio
> diagrams [eg, I want to search for all UML Class Diagrams
> in eDocs concerning class name "*LexicalHandler".]

Right, we need to think about the ability to store every type of
document.

> Actually, Visio 2000 doesn't expose that in it's Object Model
> (spent a weekend trying to get at it to write a code-gen Addin)
> though it may be possible to get at it after the internal COM
> object it uses to handle UML in Visio serializes itself into
> the .VSD file format.  In that case, I wouldn't want the
> octet-stream to be opaque to the repository.  :-)   Or, I
> might use the VBA FileSystemObject in the Add-in to
> save the diagram to disk in a TMP folder, scan it for text,
> and forward that text separately to be searchable (then
> the octet-stream could be opaque, I suppose we want
> to shoot for consistency there... )
>
> In any event, I definately see issues with searching
> Visio diagrams for text.  Support would be incomplete,
> at best. (My recommended practice, if a user has a UML
> Class Diagram in Visio 2000, he or she must list all
> applicable Class Names as keywords for metadata to
> search for them reliably.  Visio XP, I think, can export
> UML to XMI, it might need a Microsoft patch to do it,
> but that could work better.)
>
>
> PROBLEM WITH ADDING FULL-TEXT SEARCH LATER
>
> On the other hand (isn't it awful, having two hands? ;-) ).
>
> The one problem with adding the sophisticated, full-text search
> capability later is Migration for upgrading users.  We would need
> either:
>
>     1. A Migration Plan for going from a keyword-oriented search
>     facility to a full text-oriented search facility in the future.  This
>     would probably involve something tantamount to checking-out
>     and checking-in all revisions (or having smart differencing) of
>     all (the latest) documents.  :-(
>
>     2. Not support migration formally.  Perhaps allow existing
>     documents in a company's repository catalogued with keyword
>     search to remain available for keyword searchs, but be excluded
>     from newer comprehensive text searchs.  But for a pre-existing
>     document revision to be subject to comprehensive text search
>     in an upgrading organization, the user would have to check-out
>     and check-in under the new version.
>
> So, there could be a problem going from a simpler metadata to
> a more comprehensive metadata in the future for early-adopters.
> Administrators will be unhappy if it's not easy to migrate.
>
>
> CONCLUSION
>
> I'd still choose the first hand: RDF and searching on metadata
> about repository documents to begin with.  It increases the
> likelihood eDocs 1.0 happens (if eDocs 1.0 doesn't happen,
> there'll be no first release to migrate from and so nobody has
> that problem).

I think that now we need to be able to use metadata to enable serching
document, to let people be able to use the system with any
particular type of document and for us to mantain things easier.
May be in a next release we can think about full text search with
particular set of documents that needs this functionality.

Ashwini is involved in the task of define a strategy to implement the
RDF framework in our product. We hope to see soon his proposal.

Sergio