From: Casey J. <cas...@jo...> - 2011-12-18 03:25:19
|
Joe, My goal was to remove the duplicates, so if a node had another node with the same @signiture on its preceding axis it was removed. So as the document was processed, I just omitted nodes that satisfied this condition: if($n/preceding::finiteStateMachine[@signiture = $n/@signiture]) then () else $n Leaving only the first node in each set of duplicates. The next problem I am going to try and tackle is finding common blocks of child content in a document. So you may have parts of a document which repeat in a irregular way, where the duplicate child content is not identical, but subsections of child content of multiple nodes is identical. The goal being to isolate the largest sub-sections of shared content and move them out into a "shared" object. This problem is much more complext because you are not only finding duplicate content, but its content that may start and end differently, and you have to try and find the largest common blocks between sets of nodes. Cheers, Casey On Sat, Dec 17, 2011 at 9:32 PM, Joe Wicentowski <jo...@gm...> wrote: > Hi Casey, > > Very cool. It occurs to me that in cases where it's not > possible/practical to modify the source document itself, you could > write the hash to a separate file and include a reference to the > original document and node using util:node-id() and util:get-node. > (Dan's wikibook uses a separate file approach - though his article is > about looking for duplicate files, rather than what you were doing - > finding duplicate nodes within a single document.) > > On a related note, I see in 1.5dev trunk there's a convenient (albeit > apparently experimental) set of functions - > util:absolute-resource-id() and util:get-resource-by-absolute-id(): > http://exist.svn.sourceforge.net/viewvc/exist?view=revision&revision=15552 > . > > Just curious - was your duplicate detector similar to the one I > suggested, or is there something more efficient that you figured out? > > Cheers, > Joe > > p.s. Dan - Nice find with util:serialize(). I had only seen the > "serialize to file" variant of the function - > http://demo.exist-db.org/exist/functions/util/serialize. And great > wikibook article - especially the bit about the importance of defining > what "sameness" is. > -- -- Casey Jordan easyDITA a product of Jorsek LLC "CaseyDJordan" on LinkedIn, Twitter & Facebook (585) 348 7399 easydita.com This message is intended only for the use of the Addressee(s) and may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, please be advised that any disclosure copying, distribution, or use of the information contained herein is prohibited. If you have received this communication in error, please destroy all copies of the message, whether in electronic or hard copy format, as well as attachments, and immediately contact the sender by replying to this e-mail or by phone. Thank you. |