Re: [Exist-open] Finding duplicates, looking for a unqiue solution.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Joe,

My goal was to remove the duplicates, so if a node had another node with
the same @signiture on its preceding axis it was removed.

So as the document was processed, I just omitted nodes that satisfied this
condition:

if($n/preceding::finiteStateMachine[@signiture = $n/@signiture]) then ()
else $n

Leaving only the first node in each set of duplicates.

The next problem I am going to try and tackle is finding common blocks of
child content in a document. So you may have parts of a document which
repeat in a irregular way, where the duplicate child content is not
identical, but subsections of child content of multiple nodes is identical.
The goal being to isolate the largest sub-sections of shared content and
move them out into a "shared" object. This problem is much more complext
because you are not only finding duplicate content, but its content that
may start and end differently, and you have to try and find the largest
common blocks between sets of nodes.

Cheers,

Casey

On Sat, Dec 17, 2011 at 9:32 PM, Joe Wicentowski <jo...@gm...> wrote:

> Hi Casey,
>
> Very cool.  It occurs to me that in cases where it's not
> possible/practical to modify the source document itself, you could
> write the hash to a separate file and include a reference to the
> original document and node using util:node-id() and util:get-node.
> (Dan's wikibook uses a separate file approach - though his article is
> about looking for duplicate files, rather than what you were doing -
> finding duplicate nodes within a single document.)
>
> On a related note, I see in 1.5dev trunk there's a convenient (albeit
> apparently experimental) set of functions -
> util:absolute-resource-id() and util:get-resource-by-absolute-id():
> http://exist.svn.sourceforge.net/viewvc/exist?view=revision&revision=15552
> .
>
> Just curious - was your duplicate detector similar to the one I
> suggested, or is there something more efficient that you figured out?
>
> Cheers,
> Joe
>
> p.s. Dan - Nice find with util:serialize().  I had only seen the
> "serialize to file" variant of the function -
> http://demo.exist-db.org/exist/functions/util/serialize.  And great
> wikibook article - especially the bit about the importance of defining
> what "sameness" is.
>

-- 
--
Casey Jordan
easyDITA a product of Jorsek LLC
"CaseyDJordan" on LinkedIn, Twitter & Facebook
(585) 348 7399
easydita.com

This message is intended only for the use of the Addressee(s) and may
contain information that is privileged, confidential, and/or exempt from
disclosure under applicable law.  If you are not the intended recipient,
please be advised that any disclosure  copying, distribution, or use of
the information contained herein is prohibited.  If you have received
this communication in error, please destroy all copies of the message,
whether in electronic or hard copy format, as well as attachments, and
immediately contact the sender by replying to this e-mail or by phone.
Thank you.

Re: [Exist-open] Finding duplicates, looking for a unqiue solution.

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Finding duplicates, looking for a unqiue solution.