[GUSDEV] Inserts CD-HIT

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello GusDev members,

I need to load the cd-hits file result.
If somebody already made something similar it could help me, please.

My ideia is as follow:

The input file format is (cd-hit file results):
>Cluster 0
> 0       17392aa, >AAZ14281.1... *
> 1       17392aa, >XP_843163.1... at 100%
> 2       17392aa, >XP_843163.1... at 100%
> >Cluster 1
> 0       10589aa, >AAN35571.1... *
> 1       10589aa, >XP_001347658.1... at 100%
> 2       10589aa, >XP_001347658.1... at 100%
> >Cluster 2
> 0       10287aa, >XP_966264.1... *
> 1       10287aa, >XP_966264.1... at 100%
> >Cluster 3
> 0       10061aa, >CAD51479.1... *
> 1       10061aa, >XP_001351672.1... at 100%
> 2       10061aa, >XP_001351672.1... at 100%

Then, I think to use three tables for that:
Dots.sequencesequencegroup
Dots.sequencegroup and
Dots.seqgroupexperiment

In the Dots.seqgroupexperiment table, I'll put the description of the executed CD-HIT . Ex. (Dots.SeqGroupExperiment.description = "tcruzi vs tcruzi 100%"
Dots.SeqGroupExperiment.sequence_source = "tcruzi"
Dots.SeqGroupExperiment.percent_identity = "1")

For the groups, I'll use Dots.sequencegroup. Ex.(Dots.SequenceGroup.number_of_members = 3
Dots.SequenceGroup.number_of_taxa = 1 
Dots.SequenceGroup.min_percent_match = 1
Dots.SequenceGroup.max_percent_match = 1)

and in the Dots.sequencesequencegroup, I'll put the sequences for each group. 
Ex. (sequence_id = aa_sequence_id of the sequence
sequence_group_id = the identifier of the group
source_table_id = in this case the identifier of the Dots.TranslatedAASequence)

Thanks for your help,

Margarita Ruiz
Oswaldo Cruz Institute
Rio de Janeiro, Brazil

---------------------------------

¡Sé un mejor besador!
Comparte todo lo que sabes sobre besos en:
http://telemundo.yahoo.com/promos/mejorbesador.html