Re: [Exist-open] Backup of collections with more than 1000000 sub-collections.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Xiaodong,

Here is what I would do.

Information that you want to search for should be stored inside the record. It is not a good idea to store this information in the file name or the collection name, since they cannot be indexed. If you have the information you want inside the record, you can give the files and collections arbitrary names and do not have to worry so much about the database structure.

I would store the person id and the document type (OutpatientSummary, OutpatientPrescription, and so on) as an attribute inside each record and give the files a UUID as a name. I think some of your files already store the person id and the document type is stored in the xsi:schemaLocation. Some of the files also have a UUID already, and you can use these UUIDs to create the file name. When the same person has several records of the same type, a date or timestamp can distinguish between them.

Since you have one million files, I would divide them into 256 collections (00 to FF) and store the files in them according to the first characters of their UUID (< 4000 in each). Then you only have to worry about 256 __contents__.xml!

It looks like a good idea to use the person id to organize files and collections. Perhaps this is what we would do on our desktop or in a relational database, but the structure of your database becomes needlessly complicated in this way.

I am sure that there are other ways to do this and I am not an expert in optimization or database design. 

Jens

On Jun 29, 2013, at 6:03 PM, easy  <li...@12...> wrote:

> Hi，Jens，
> 
>    Thanks.
>  
>  I have described my application scene there before. I plan to manage  resident's electronic health record for a city,which has more than 1,000.000 people.  because the number of people is large ,everyone has more than 100 files,  I have no good way to organize the db's structure, I found this way is a option.
>  I plan to map the people's ID into path of collection. for example: a man with ID: 510210195502043434, I put all his EHR file into : /db/510210/195502/043434. so when I want query this man's record with ID,I can only query in collection('/db/510210/195502/043434'), the performance is good I think. and there will be no much more files in a collection (if there are 1,000,000 file in a collection, what's the result for query? reindex? ) 
>   so ..
> 
>   Because the data is important,so I need do backup at everyday.  because it's large, so I create a full backup  first , then I create incremental backup everyday, but I found there is more than 1G for incremental backup file even there is few update for db. 
> 
> 
> 
> 
> --
> 此致
> 
>  
> 莫愁前路无知己，天下谁人不识君。
> 
> At 2013-06-29 23:22:07,"Jens Østergaard Petersen" <oe...@gm...> wrote:
> Hi Xiaodong,
> 
> I think everyone can see your problem, but can you describe a restore process that does not have __contents__.xml when collections or resources have not been changed? I am not saying that this is impossible, but a restore process like that will be a lot more complicated than the present restore process. Perhaps things as important as backup and restore have to be kept simple, even if the backups take up a lot of space?
> 
> I do not know the structure of your data, but perhaps it is not necessary for one person to have one collection?
> 
> One thing that is odd is that (as can be seen below) for each collection and resource listed in __contents__.xml both a "name" and a "filename" is given. These are not identical, since "name" is percent escaped and "filename" not. Having only one of these will not solve your problem, but I don't see why this information has to be duplicated. 
> 
> Jens
> 
> On Jun 29, 2013, at 9:13 AM, easy <li...@12...> wrote:
> 
>> 
>> This is the __context__.xml example:
>> 
>> <collection xmlns="http://exist.sourceforge.net/NS/exist" name="/db/XMLDB/130635/19780718/1468" version="1" owner="admin" group="dba" mode="755" created="2013-04-10T08:00:02.15+08:00">
>>     <acl entries="0" version="1"/>
>>     <subcollection name="A0201" filename="A0201"/>
>>     <subcollection name="A0202" filename="A0202"/>
>>     <subcollection name="A0203" filename="A0203"/>
>>     <subcollection name="A0204" filename="A0204"/>
>>     <subcollection name="A0103" filename="A0103"/>
>>     <subcollection name="B0006" filename="B0006"/>
>>     <subcollection name="B0001" filename="B0001"/>
>>     <subcollection name="B0002" filename="B0002"/>
>>     <subcollection name="A0402" filename="A0402"/>
>>     <subcollection name="B0004" filename="B0004"/>
>>     <subcollection name="B0003" filename="B0003"/>
>>     <subcollection name="A0401" filename="A0401"/>
>>     <subcollection name="A0301" filename="A0301"/>
>>     <subcollection name="A0302" filename="A0302"/>
>> </collection>
>> --------------------
>>  there is only info about the collection "/db/XMLDB/130635/19780718/1468" creted time, version, no update time, if the incremental backup done each hour, this collection no updation, but the incremental file will include this file in each time and is same .   is needed?
>> 
>> 
>> --
>> 此致
>> 
>>   easy
>> 
>> 莫愁前路无知己，天下谁人不识君。
>> 
>> At 2013-06-29 15:05:47,easy <li...@12...> wrote:
>> 
>> I don't know why need to add a __context__.xml in every collection,even there is no update in the collection and its children collection?
>>  This will cause a large number of empty __context__.xml (just the child collection name list ) in  increment backup file for a db with large number of collection, and  more is there need to create each collection in backup file even for which no updated?
>> 
>> So ,I think ,the important is the method to check if there is any different from last one or not. currently implemented will cause lot of diskspace lost for large number of collection.
>> 
>> 
>> 
>> --
>> 此致
>> 
>> easy
>> 
>> 莫愁前路无知己，天下谁人不识君。
>> 
>> At 2013-06-28 20:37:37,"Dmitriy Shabanov" <sha...@gm...> wrote:
>> I didn't manage to check, but I think that this __context__.xml must have this information because when you run next time backup it should check if there is any different from last one or not.
>> 
>> As alternative I see only "archive" flag .... any other idea? 
>> 
>> On Fri, Jun 28, 2013 at 2:59 PM, easy <li...@12...> wrote:
>>   I find  a problem about  backup.   I have a exist-db with more than 1000,000 collection( how to store everyone's EHR in a large city?) , I make a backup job, with inremental backup = true, but I find ,the backup file include a __contexts.xml__  for every collection even there has no update in the collection, so the incremental  backup file size always more than 100M (just list the collection ,__contents.xml__).
>>   so ,the backup method need to think  how to deal with a db or collection with large number of child collecions.is it right?
>> 
>> -- 
>> Dmitriy Shabanov
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> This SF.net email is sponsored by Windows:
>> 
>> Build for Windows Store.
>> 
>> http://p.sf.net/sfu/windows-dev2dev_______________________________________________
>> Exist-open mailing list
>> Exi...@li...
>> https://lists.sourceforge.net/lists/listinfo/exist-open
> 
> 
> 

Re: [Exist-open] Backup of collections with more than 1000000 sub-collections.

eXist-db is a feature rich Open Source native XML database

Re: [Exist-open] Backup of collections with more than 1000000 sub-collections.