Hi,
I'm interested in using SDhaP for phasing several hundred diploid and tetraploid individuals. Do you have suggestions or perhaps scripts for generating the input files for SDhaP from BAM files? I'm also generally confused regarding the "Input file format" in the README file. Do you need a separate input file for each chromosome? Is number of reads the total number of reads for a chromosome? What does number of columns mean? Could you provide some more detail on what is meant by the different entries following number of columns (e.g. number of contiguous segments, SNP segments, why multiple entries for continuous bases in read, why does diploid example show two SNPs, but 0 continuous bases in reads?, etc...).
Thanks,
Patrick Monnahan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SDhaP requires the input file to be in read-snp format. Please find the explanation for various quantities below:
Number of reads: The total number of sequenced fragments with at least one SNP.
Number of columns: The total number of SNPs covered by the reads in that data file.
The following quantities are defined with respect to a given fragment (per line of the input file):
Number of contiguous segments: The number of contiguous (or adjacent) group of snps that the fragment covers.
Read identifier: A unique ID for the fragment.
Position of the first SNP segment: SNP index of the first segment for the fragment.
Continuous bases in read: Sequenced alleles in the first segment for the fragment.
For each segment of the fragment following the first one, these quantities are defined:
Position of the next SNP segment
Continuous bases in read
Finally,
Quality scores (in fastq format): Q score for alleles from all segments of the fragment. For example:
4500
1000
2 chr3_1 1 24214 37 2243 IIIIIIIII
2 chr3_2 1 2421432 46 221413 IIIIIIIIIIIII
2 chr3_3 1 4331112 44 4112 IIIIIIIIIII
2 chr3_4 1 433111 40 41134 IIIIIIIIIII
means there are in total 4500 fragments in the data file and 1000 SNPs in total. The first fragment with ID chr3_1 (line #3) has 2 segments. The first segment starts at the SNP index 1 (first SNP) and covers SNP indices 1-5 corresponding to alleles 2,4,2,1,4. Similarly, the second segment starts at the SNP index 34 and runs through SNP indices 37-40, with the corresponding alleles for this segment being 2,2,4,3. Lastly, the Q score for all 9 alleles for this fragment is 'IIIIIIIII'.
Hope this helps.
Last edit: shreepriya das 2016-08-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm interested in using SDhaP for phasing several hundred diploid and tetraploid individuals. Do you have suggestions or perhaps scripts for generating the input files for SDhaP from BAM files? I'm also generally confused regarding the "Input file format" in the README file. Do you need a separate input file for each chromosome? Is number of reads the total number of reads for a chromosome? What does number of columns mean? Could you provide some more detail on what is meant by the different entries following number of columns (e.g. number of contiguous segments, SNP segments, why multiple entries for continuous bases in read, why does diploid example show two SNPs, but 0 continuous bases in reads?, etc...).
Thanks,
Patrick Monnahan
SDhaP requires the input file to be in read-snp format. Please find the explanation for various quantities below:
Number of reads: The total number of sequenced fragments with at least one SNP.
Number of columns: The total number of SNPs covered by the reads in that data file.
The following quantities are defined with respect to a given fragment (per line of the input file):
Number of contiguous segments: The number of contiguous (or adjacent) group of snps that the fragment covers.
Read identifier: A unique ID for the fragment.
Position of the first SNP segment: SNP index of the first segment for the fragment.
Continuous bases in read: Sequenced alleles in the first segment for the fragment.
For each segment of the fragment following the first one, these quantities are defined:
Position of the next SNP segment
Continuous bases in read
Finally,
Quality scores (in fastq format): Q score for alleles from all segments of the fragment. For example:
4500
1000
2 chr3_1 1 24214 37 2243 IIIIIIIII
2 chr3_2 1 2421432 46 221413 IIIIIIIIIIIII
2 chr3_3 1 4331112 44 4112 IIIIIIIIIII
2 chr3_4 1 433111 40 41134 IIIIIIIIIII
means there are in total 4500 fragments in the data file and 1000 SNPs in total. The first fragment with ID chr3_1 (line #3) has 2 segments. The first segment starts at the SNP index 1 (first SNP) and covers SNP indices 1-5 corresponding to alleles 2,4,2,1,4. Similarly, the second segment starts at the SNP index 34 and runs through SNP indices 37-40, with the corresponding alleles for this segment being 2,2,4,3. Lastly, the Q score for all 9 alleles for this fragment is 'IIIIIIIII'.
Hope this helps.
Last edit: shreepriya das 2016-08-23