|
From: Margarita R. <rui...@ya...> - 2007-10-03 20:36:26
|
Hello GusDev members,
I need to load the cd-hits file result.
If somebody already made something similar it could help me, please.
My ideia is as follow:
The input file format is (cd-hit file results):
>Cluster 0
> 0 17392aa, >AAZ14281.1... *
> 1 17392aa, >XP_843163.1... at 100%
> 2 17392aa, >XP_843163.1... at 100%
> >Cluster 1
> 0 10589aa, >AAN35571.1... *
> 1 10589aa, >XP_001347658.1... at 100%
> 2 10589aa, >XP_001347658.1... at 100%
> >Cluster 2
> 0 10287aa, >XP_966264.1... *
> 1 10287aa, >XP_966264.1... at 100%
> >Cluster 3
> 0 10061aa, >CAD51479.1... *
> 1 10061aa, >XP_001351672.1... at 100%
> 2 10061aa, >XP_001351672.1... at 100%
Then, I think to use three tables for that:
Dots.sequencesequencegroup
Dots.sequencegroup and
Dots.seqgroupexperiment
In the Dots.seqgroupexperiment table, I'll put the description of the executed CD-HIT . Ex. (Dots.SeqGroupExperiment.description = "tcruzi vs tcruzi 100%"
Dots.SeqGroupExperiment.sequence_source = "tcruzi"
Dots.SeqGroupExperiment.percent_identity = "1")
For the groups, I'll use Dots.sequencegroup. Ex.(Dots.SequenceGroup.number_of_members = 3
Dots.SequenceGroup.number_of_taxa = 1
Dots.SequenceGroup.min_percent_match = 1
Dots.SequenceGroup.max_percent_match = 1)
and in the Dots.sequencesequencegroup, I'll put the sequences for each group.
Ex. (sequence_id = aa_sequence_id of the sequence
sequence_group_id = the identifier of the group
source_table_id = in this case the identifier of the Dots.TranslatedAASequence)
Thanks for your help,
Margarita Ruiz
Oswaldo Cruz Institute
Rio de Janeiro, Brazil
---------------------------------
¡Sé un mejor besador!
Comparte todo lo que sabes sobre besos en:
http://telemundo.yahoo.com/promos/mejorbesador.html |