Damian Gessler


This example demonstrates a fully-functional walk-through as of the last iPlant/Cartogratree funding cycle. The SSWAP server is servicing requests, but changes to underlying web services and sites mean that some steps may no longer engage or may return alternative page views.

This page documents how SSWAP semantically integrates data and services across sites.

From TreeGenes to High Performance Computing

TreeGenes provisions a public repository for forest tree genetic data recorded from high-throughput genomics projects and has maintained valuable legacy Sanger data for over 15 years. The database provides a foundation for sequence data including genotypes, annotations, sequence assemblies, and genetic maps. TreeGenes houses information for over 1,200 forest tree species, 900,000 sequences, 24,000,000 genotypes, and 19,000 phenotypes.

XSEDE is the National Science Foundation funded Extreme Science and Engineering Discovery Environment.

Here, we show how we start with IDs of contiguous lengths of DNA sequences ("contig" IDs) associated with "copper ion binding" from TreeGenes, gather data from those contigs back to their underlying tree samples, view and select on the samples geospatially, refine a selection of their genomes to those areas functionally associated with copper ion binding, then align those sequences to view their evolutionary relatedness.

For best results, use a high-speed broadband connection and a modern browser. You will need Javascript enabled.

1. Visit http://dendrome.ucdavis.edu/DiversiTree (right-mouse click and select from menu to open in new window)

2. On the left column, open +Contig, search on "copper ion binding" as a GO (Gene Ontology) Term, and press Display. Select a set of records.

3. Press sswap.info. The data is packaged, sent to sswap.info, and a semantic pipeline is created. Your data appears as a TreeGenes Contig icon, and services that can process that data are displayed. Service selection is done by a reasoner that examines the incoming data, examines a knowledge base of services, and presents you with services that operate on that data.

4. Drag-n-drop the TreeGenes Tree Sample service into the pipeline. Press Play. Be patient; retrieving the data from the Web service may take a few minutes.

5. The data is sent to the TreeGenes Tree Sample service, where the contig IDs are mapped on to their originating tree samples. Using the Output Data Set menu, you may view the data. Double-click the icon, or select View from the pull-down menu.

6. Browse the data in the Data tree. The Data tree shows how the primary data objects (e.g., urn:treeGenes:treesample:id:01011) are related to other data objects (e.g., http://purl.org/taxa/ncbi/3352) by ontological properties (e.g., taxa:hasTaxa). The columns show values for properties of the various data objects.

7. SSWAP allows service providers to associate viewers with their services. Click Open Viewer to launch the data into the CartograTree viewer. If a viewer is set for the service, you may also launch it directly from the Output Data Set of the pipeline by clicking on the Display Data link.

You can jump to this step directly by clicking this link CartograTree RESTful data (right mouse-click to open in a new window).

8. Zoom in to the Western United States. Using either individual tree selection or polygon selection, choose the two trees in California and Mexico. The selected trees will appear in the table. Select All.

9. From the pull-down menu at the right of the table, choose Get Common Amplicon and press Run. This invokes an algorithm that scans across the selected tree samples and finds amplified segments of DNA (sequences) common across the selected samples (and possibly other samples too, because an amplicon may be common to many tree samples). Homologous sequences can be aligned, and thus used to build phylogenetic trees. You can choose amplicons based on their Gene Ontology classifications or other criteria. In this example, we filter on the column 'Amplicon ID' = UMN_CL228Contig1_03. Select the amplicon. Choose Send to sswap.info from the menu and press Proceed. Confirm on the next dialog and press the sswap.info button. The data is sent to sswap.info for Web Discovery.

10. For Web Discovery the data is bundled, and sent to sswap.info. You are presented with a new pipeline, and with services that can operate on the data. On the backend, a reasoner has examined the data and presented to you those services, and only those services, that can operate on that data.

11. Let's get the FASTA sequences associated with the amplicons that are common the two tree samples. Drag-n-drop the TreeGenes MultiFasta service to retrieve the sequences. Press Play

12. When the pipeline finishes we get direct access to the sequence data. Select the output MyData icon and choose View from the Output Data Set menu and click on the link in the Data tree table. The data is RESTfully served.

13. We are now going to invoke High Performance Computing resources at the Texas Advanced Computing Center (TACC). With the retrieved sequences, let's align them, build a phylogenetic tree, and visualize it. The sequences are at TreeGenes in California (from the preceding step), the alignment and phylogenetic tree building services are on the 22,656 core, 44 TB HPC TACC cluster Lonestar, and the visualization service at the University of Arizona. You will just drag-n-drop services to build the pipeline; the platform will take care of everything else.

HPC resources are free to you to use for research, but are not unlimited. You must first register and login.
Select Login at the top right (and register for an account if needed) and login.

Drag-n-drop the following services. As each service is added, a reasoner runs on-demand to find those services that can operate at the cursor position.

  1. Select the output MyData icon in the pipeline. Choose 'Create new pipeline with this data' from the Output Data Set menu. (Alternatively, you can append to this pipeline and rerun it from the beginning. To do so, select the output MyData icon and begin to drag it down. A trash can will appear; put the data in the trash). Either way, new services will now appear that can operate at this step.
  2. Add MUSCLE to align the sequences
  3. Add FastTree to build a phylognetic tree. Notice the gear on the FastTree icon. This means it has parameters that can be set.
    1. double-click the icon (or choose 'Set parameters' from the FastTree service pull-down menu)
    2. SSWAP supports third-party custom web interfaces for services; choose 'Nucleotide' as the Sequence Type and submit
    3. if you now select 'Edit configuration' from the FastTree service pull-down menu and open the 'optional parameters' line, you will see that the -nt flag has been set in the 'arguments' parameter string 
    4. press Cancel in the 'Edit configuration' dialog box
  4. Add TreeViz to visualize the tree
  5. Double click on the pipeline title (or use the menu) and give it a memorable name such as "Amplicon UMN_CL228Contig1_03 alignment, phylogenetic tree, and visualization"
  • You can also add services by selecting the service icon and choosing 'Add to pipeline' from the menu.
  • You can move the insertion cursor within the pipeline; at each move, the reasoner will recalculate candidate services based on the up- and down-stream services.

14. What if you wanted to align sequences and build a tree but did not how? Wouldn't it be useful if someone who did know could do it and then publish their pipeline? You can do that. You can publish any pipeline. Publishing retains parameters settings (like the nt setting for FastTree nucleotide sequences above), so users do not need to be familiar with service settings' minutiae. When running a published pipeline, users always run their own copy, with their own data. So users can independently reset parameters as they choose, or even remove some services and add others (for example, to change a nucleotide sequence pipeline into an amino acid sequence pipeline). We've published the above pipeline as MUSCLE -> FastTree -> TreeViz with the nt parameter set for nucleotides as a pipeline.

Semantic reasoning occurs over pipelines (their input and their final output) just as for services, so you can drag-n-drop the Phylogenetic Tree Building – nucleotide (MUSCLE -> FastTree -> TreeViz) pipeline icon and add it in one step. You will also discover, and can use, other pipelines, such as pipelines using MAFFT or ClustalW2 for multiple sequence alignment. You will not see these pipelines as search results if you have already built the pipeline manually as above, because the reasoner knows that the output of TreeViz does not feed into the input of the Phylogenetic Tree Building pipeline. If you make a copy of your pipeline (use the menu pulldown next to its title and check Retain input and output data), remove the services (starting with TreeViz and proceeding in reverse order, click and drag down on each service icon and a trash can will automatically appear), you will find the Phylogenetic Tree Building pipeline as a candidate for placement after TreeGenes MultiFasta. When a pipeline is a component of a super-pipeline, you can double-click on its icon to open a window to examine, change, and even independently run the subpipeline.

15. HPC resources run quickly. But there may be a wait on the queue, and actual wall-clock runtime could be minutes to hours. As HPC service status changes from STAGING to PENDING to QUEUED to RUNNING to ARCHIVING_FINISHED (and other states in-between), you can monitor the progress in the service details panel.

Bookmark the pipeline in your browser. Close your browser and return at any time. Load the page and you will see the current state. If you forget to bookmark a pipeline -- no problem: use your browser history, or just login and search on user:login (with your login name) to retrieve all your pipelines, published and unpublished. You can also use user:login to search on any user name, but for users other than yourself you will discover only their published pipelines.

When finished, your browser should look like this:

Stuff happens
In the real world, stuff happens: arbitrary sequence data may fail to align; network conditions may loose connections – after all, this is a web of interconnected, independent data and services. SSWAP is robust to these situations and gives you error condition reporting based on what it gets from the underlying services. If your pipeline does not run to completion, examine it, or sometimes just run it again; you can't break anything.

16. The TreeViz visualization service accepts tree data and builds a tree offline, asynchronously. It then returns a link to the tree for visualization. The platform handles these types of services too. Select the MyData output data icon, and click on the Display Data link to view the tree in a new window. Like CartograTree, this shows how SSWAP can coordinate with third-party, independently developed presentation managers and renderers.