Menu

Help

John Archer

Back
CView has six top-level menu options. Four of these determine what subset of the aligned sequences the methods within are applied to. These four options, as well as the sequence subsets that they apply to, are:

(1) All Sequences: contains methods applied to the entire alignment.

(2) Title Search: contains methods applied to sequences that have a specified set of characters within their title, for example, a year “1990”, a geographical location “North America” or a viral subtype “subtype B”.

(3) Node Path: contains methods applied to sequences that pass through a user selected node on the associated network.

(4) User Group: contains methods applied to sequences defined by the user through a list of uploaded titles.

The final top-level two menu options are:

(5) Plug-ins: An area where plug-ins developed for specific sequence analysis objectives will be placed.

(6) About: A brief summary of the CView software, people involved and a brief help area.

Of the four top-level menu options that apply to different sequence subsets there are multiple method choices available. For each choice, with the exception of “Save Titles”, the user specifies the co-ordinates of the alignment that the choice is to be applied to using the “From Site” and “To Site” textboxes within each of the associated pop-up windows. The pre-filled default values for these co-ordinates are 1 to the length of the alignment, i.e. all sites. The method choices are:

(i) Load Fasta
Summary: loads a fasta-formatted alignment.
The user selects whether the residue characters within the alignment to be loaded represent nucleotides or amino acids. This selection is only used to speed up some of the available method options. For example, for large alignments if this option is not specified, each sequence need be checked to make sure all individual characters present are accounted for, even erroneous ones. By selecting “Nucleotides” or “Amino acids” the appropriate predefined character library is instantly selected and only the top 20 sequences are subsequently checked for any “alternative” characters that may exist.

(ii) Save Region
Summary: saves a region of the alignment.
Saves the user specified region of the alignment. The slider bar titled “Characters Per Line” indicates the number of characters saved per line within the fasta formatted output file. For example, if a nucleotide sequence is 500 nt in length, and this slider is set to 100, then the sequence will be saved across six lines, the first line containing the title, prefixed by a ‘>’ and the next five lines containing 100 nucleotides each i.e. fasta format.

(iii) Save Titles
Summary: saves the sequence titles.
Sequence titles will be saved as a list within a specified text file.

(iv) Save Unaligned
Summary: saves the sequences as unaligned.
Each sequence will have the ‘-’ symbol removed and saved in fasta format as described in (ii). Note: unaligned sequences cannot be loaded in to CView as it is an alignment viewer.

(v) Consensus Sequence
Summary: saves the consensus sequence.
The majority character of each site will be selected to produce a consensus sequence. This sequence will be saved in fasta format as described in (ii). For sites where there are majority characters of equal frequency, alternatives will be placed in a table below the consensus sequence within the file.

(vi) Hamming Distances
Summary: saves a matrix of hamming distances between sequences.
Between each pair of sequences, within the user specified region, the number of sites where characters are not the same is counted. A matrix between all pairs is outputted into the specified tab delimited text file. Because all sequences in the alignment are of the same length raw counts are saved and not length normalized values. Sequence titles are placed along the top of the file as well as in the first column (left hand side) of the file.

(vii) Cluster Sequences
Summary: clusters sequences.
Sequences are clustered according to their hamming distances. Clusters are sorted by size and placed in descending order in the specified output file. The number of sequences within each cluster is specified, and the individual sequences are placed below that number in fasta format, making use of a “Characters Per Line” slider similar to that described in (ii). Each cluster is created using an iterative approach. Initially a sequence is randomly selected to be a seed for a newly created empty cluster. All related sequences to that seed are then added to the cluster and will themselves become seeds for the next iteration. This process continues until no more seeds are identified, at which point, if unclustered sequences remain a new cluster in initiated. With a clustering threshold of 0.20, all sequences within a given cluster have at least one neighbour to which they are less than 20% divergent. Similarly, for a cluster threshold of 0.05 all sequences with the cluster have at least one neighbour to which they are less than 5% divergent. The clustering threshold is specified through the slider bar titled “Cluster Threshold” that is associated within the pop-up window for this menu option. This method of clustering means that any long branches between groups of sequences will result in a new cluster being formed, similar to identifying clades on a phylogenetic tree.

(viii) Residue Frequencies
Summary: saves per site residue frequencies.
A matrix of per site residue frequencies spanning the specified region of the alignment will be saved. Such a matrix is useful when looking for things like drug resistant associated polymorphisms, creating position specific scoring matrixes as well as characterizing the differences between two groups of aligned sequences. For example, within a single alignment containing viruses isolated through different years, a matrix associated with each specific year can be created using the “Title Search” –> “Residue Frequencies” option. These matrixes can then be compared to identify polymorphisms associated with individual years. A similar scenario could be described geographic locations.

(ix) Per Site Kmers
Summary: saves per site kmer frequencies.
The user specifies the kmer length by using the “Kmer Length” slider. Then starting at each site of the alignment all kmers of this length are extracted from all sequences at that site. For each site unique kmers are outputted along with their frequencies. Kmers are another way to characterize variation with an alignment. They are also the short sequence fragments used as seeds in many bioinformatics algorithms such as those involving mapping, de novo assembly and alignment generation itself, and as such are of implicit interest to many areas of computational biology.

(x) Global kmers
Summary: saves global frequencies.
Similar to (ix) but kmers are identified across the whole alignment in a site independent manner.

(xi) Variant Frequencies
Summary: saves variant frequencies.
Across a user-defined region all sequences are condensed into unique forms and the frequency of occurrence of each is maintained. The title of each variant follows the following example: >VARIANT_1_FREQUENCY_32, which means that there were 32 identical sequences across the user specified region. Variants are sorted in descending frequency. The sequence itself corresponding to variant_1 is written in fasta format. Once all variants have been listed in this manner the next part of the file lists the sequence titles that were associated with each one. For example, for the above >VARIANT_1_FREQUENCY_32 there would be 32 titles listed where each belonged to one of the sequences that were identical in the specified region. Finally, below these titles the variant sequences are repeated, but this time any character that is identical to the most frequency variant at the top of the list is replaced with a ‘|’ character. This way, these ‘|’ characters can easily be removed, and replaced with a blank space. When printed in currier formatted text (equal widths) and individual characters that were different to the most common variant are then easily identifiable by eye.

The last two top-level menu items provide information “about” the software and a pointer to this help document, as well as an area to place developed plug-ins. So far as a demonstration with the plug-in section we have implemented a method titled “variant scan”.

In brief this method allows the user to select two groups from the alignment using two different files containing the titles of the sequences within each group. Other top-level menu options can aid in this, such as the “Save Titles” in the “Title Search” menu, where titles containing specific tags can be exported. Once the two groups have been defined, the user selects a minimum residue frequency for the first group (A) and a threshold frequency for the second group (B). All sites where a residue is above the minimum residue frequency in (A) BUT the same residue is below the threshold frequency in (B) will be outputted along with residue frequencies. This can be used for example to identify pre-defined shifts in residue frequencies between years within a viral population.

Methods we are planning in the near future are position specific scoring tests involving any specified region of the alignment. As well as drug resistance tests, where the user can input a file of known drug resistance mutations for a specific virus, that they have aligned sequences for, and the plug-in will detect what resistance mutations are present.


Related

Wiki: Home