Menu

Home

Karl Miller

Karl Miller
18 December 2012

GEO Profiles Search Engine for Housekeeping Genes

Abstract

The GEO Profiles Search Engine for Housekeeping Genes began as software project in the undergraduate Bioinformatics Course taught at SUNY Oswego in Spring 2011 by Prof. Elaine Wenderholm.

THE NCBI GEO Profiles database does not provide a direct way to find good housekeeping genes. Instead, biologists must make several queries and manually read through many web pages of information (this can easily be hundreds), analyze and sort the data by hand. This project provided a web-based tool to automate this task.

Information is retrieved from the NCBI GEO Profiles NCBI Geo Profiles database. The retrieved data is parsed, analyzed, and a web page is constructed and served to the user. The web page contains the GEO Profiles graphics with links to the GEO Profiles data pages for each gene expression profile.

biology-online.org defines housekeeping genes as “genes that are always expressed because they code for proteins that are constantly required by the cell, hence, they are essential to a cell and always present under any conditions. It is assumed that their expression is unaffected by experimental conditions. The proteins they code are generally involved in the basic functions necessary for the sustenance or maintenance of the cell.”

We analyze the gene expression profiles data to determine those profiles that tend to have the least amount of variation. The gene profiles are sorted on "best" housekeeping genes.

This tool implements the server-side software in Java using both standard environment and enterprise environment libraries. The presentation is implemented with a combination of servlets, html, and AJAX.

This project has evolved beyond the course requirements. It now retrieves all results, analyzes those results outside the browser session, then notifies the user via email upon
completion. User accounts are created to store and manage previous searches.

Introduction

This document assumes the reader has some familiarity with the terms and concepts of HTTP (and its request methods), Java (SE/EE), Javascript, Perl, PHP, AJAX, ANT, JPA, Hypersonic, Hibernate, MySQL, XML, HSQL, client-side, server-side.

Methods of Data Retrieval

NCBI provides eUtilities web-based services that allow easy database searching using GET requests. (http://www.ncbi.nlm.nih.gov/books/NBK25501/).

The usage policy for these services
(http://www.ncbi.nlm.nih.gov/books/NBK25497/) places limits on each query
(http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4): eSearch is limited to 100,000 UIDs and eSummary is limited to 10,000.

This is mitigated by incrementing the index of the start of the query, as desribed in the documentation given above. To retrieve the data pertaining to all samples for a particular experiment, a sequence of queries must be made to identify the data:

An eSearch request is make to retrieve all the UIDs (through separate queries if necessary) for a particular search term.
An eSummary request is made to retrieve several objects from the summary results. This query returns a string of data. It is used to build an image URL that represents the sample data, and a string-based ID that is used to retrieve the sample data itself.

Access to the sample data is not provided by eUtilities. Instead, we need to "scrub" the returned HTML pages This must be done in order to:

limit the amount of data sent from the GEO Profiles database to our own server; and,
limit the number of cycles needed to parse the document.

The first implementation accomplished this simply by using a CGI call within a detailed data page for an experiment. The CGI retrieved only the data pertaining to the samples themselves.

Since then the NCBI interface has changed so that we must grab the initial page itself. This page is not significantly larger than the original CGI call, but now we must retrieve data twice the size from the original implementation, several thousand times over. This adds considerable overhead.

Once we retrieve the sample data we calculate the standard deviation and the median absolute deviation. These values are used to sort the data.

Data Handling

JavaEE was chosen for implementation because of its efficiency at managing web applications and because of my familiarity with JavaSE.

Client-side presentation simply involved choosing a popular client-side scripting language. I chose Javascript. I chose JBoss version 6.0.0.Final for application server (AS), again because of my familiarity with the software.

I made a choice to not use an integrated development environment (IDE). IDEs such as Netbeans and Eclipse tend to generate the deployment descriptors in ways not conducive for learning. Also, at the time of choosing these technologies, there was no support for
JBoss version 6 was using in Eclipse. (At the time the highest was version 5.) It was in my best interests to use the latest version of JBoss and attempt to do the eployment descriptors by hand. Therefore, I built and deployed the web service using ANT scripts.

The Web Service Tool

The user must perform a two-step process to obtain gene data using this Web Services tool.
1. The user constructs a query for the gene of interest using standard NCBI search rules.
2. The user provides an email address. The web service notifies the user via email when the query results are available. The user logs in with his email address to obtain the results.
This setup follows the model-viewcontroller (MVC) architecture.

There are four HTML pages, each with a client-side Javascript. All pages issue server POST POST requests to server-side servlets.

The server attempts to determine if there is an active session for a particular browser and/or
user corrently logged into the server. In this case the server redirects the user from these
four HTML pages.

Following the request from the client's browser, the server receives different equests via servlets. All servlets return XML documents. Clients that do not have an active
session receive an xml document with only a <noSession>...</noSession> node with a message relative to the error as its value.

Similarly, most exceptions are propagated up to the web-tier by printing the exception to a log file and either propagating the exception up itself, or returning only the message as a String to the previous stackframe. It eventually is sent to the client as an <error>...</error> node with the error message itself as the message. If the exception is fatal, then it is caught or thrown in such a way to interrupt that particular request.
The intent is to provide the client some limited exposure to why their request failed. Most exceptions that get returned this way have descriptive messages such as “You cannot perform X while not logged in.” The servlets related to functionality that a logged-in user would perform makes checks to see if the session is active before proceeding, and handles the session (or lack-thereof) accordingly.

Following the web-tier is obviously the business tier. The entry point to this tier is always a UserBean. The UserBean is a stateful EJB with a state machine managed internally. The UserBean itself is a shell to attach to a session and provide calls to functions inside the internal state machine.

There are two states in the state machine. The initial state is allowed to create a user, or log
in. The second (the logged-in state) is allowed to create a query, return all the search terms
this user has ever used, and return the results of a particular query. Any attempt at an action that
is not allowed by the user's current state throws an IllegalStateException, and the error is reported as defined above. We use several JNDI lookups inside the state machine to handle calls to other EJBs.

Upon receiving a request for a particular search term and determining the that data currently does not exist on the server, the server makes the initial setup for the query to the GEO Profiles database, spins off a new Thread to perform the query. and immediately notifies the user with
a response. This new Thread is necessary since many queries can take hours to complete.

When a query is completed, the server sends off an email to the address supplied by the user during account creation to notify the user that their query has been completed.

While the query is still executing, the data retrieval happens as mentioned above using the ENTREZ
documentation. After it retrieves the data, the deviations are calculated. The calculated values are persisted using the same IDs as GEO Profiles uses with the deviations only. All of the sample data is discarded because the overhead of maintaining that information is too large. Persisting the data to our own datasource is unnecessary after we have computed our deviations: the sample data for an experiment will not change in NCBI GEO Profiles.

Two many-to-many relationships are maintained after retrieving this data. The first relationship is between the data itself and a search term seeing as a set of data may have many different queries associated with it (the sample data is not pairwise disjoint from all other
search terms) and a particular search term may have many different experiments associated with it.

The second relationship is between a user and a search term. A user may have many different queries, and a particular query may have been requested by many different users. This many-to-many relationship is maintained in an attempt to avoid redundant searches.

Following a completed query, a user is permitted to retrieve results from any query that they have
completed. The data is retrieved using simple queries to the datasource constructing using the Java Persistence API (JPA). The data that is returned is for a particular query, provided that the user has made that query themselves. This shouldn't be an issue to most users as they will access it through the web interface, which only displays queries that they have made previously.

Since we are looking for data that represents a good housekeeping genes, the results are displayed in ascending order according to the deviation of choice.

Installation

The large benefit to using JBoss is that it is basically runnable “out of the box”.
Additional configuration is necessary only to achieve things such as an alternate datasource away from Hypersonic.

The version I used here was JBoss-6.0.0.0 and by default, JBoss binds to port 8080.

To change from this port from the base JBoss directory, you must edit the ./server/default/deploy/jbossweb.sar file. In the tag that specifies the connector with the
protocol attribute “HTTP/1.1”, change the port attribute to be whatever port you desire.

The other immediate concern for JBoss, is the administration console. It is
available at your base url, without supplying further direction to your web service. The default user name is 'admin' and 'admin'. These values can be changed in “/server/configuration/conf/props/jmxconsole-users.properties.”

When starting the AS itself, you need to use the run.sh (assuming a linux
environment for example) with option -b 0.0.0.0 to run the AS on all interfaces.

The server also depends on a mailer daemon running on the server. The current EJB that is set up to send notification emails depends on the mailer daemon being on the local machine, as it spoofs addresses to ensure the receiver knows the address is not real (noreply@server.com).

This is only possible if the email originates from the same machine that the mailer daemon sits on.

From here you should be able to just start up and run your web service. If you decide to use an IDE instead of ANT and a text editor, you must import all the classes and make all
the modifications necessary to allow NetBeans to make a successful deployment.
If you decide to use an IDE but would like to maintain the current deployment descriptors and would merely like to use the IDE for code-completion and comply-on-the-fly functionality, then all that needs to be done is to tell NetBeans to not use its own build.xml files. This is done by creating a project, and merging the newly-created NetBeans project directory with the existing project directory. NetBeans automatically generates a build.xml file inside the base directory of your project upon project creation. It dynamically updates a build-impl.xml file inside the nbproject directory. The build.xml file in the base directory tells NetBeans to use the build-imp.xml file, so by overwriting the build.xml file in the base directory with our own, Netbeans now does what is directed in our xml file. The disadvantage of this is that the content of the file is not dynamically generated, so any modifications to AS location, or sub-package additions or removal will need to be manually updated within the build.xml file.

Fortunately, the build.xml file I created is much more trivial than the one NetBeans creates by default, so it should be fairly easy to make those additions. To be clear, aside from my supporting functions for NetBeans, my ANT script builds everything in the build directory with all the class files being placed in the build/src directory, the war file being created in the the war directory and so on, until the end contents in the ear directory are placed into the ear file and then placed in the root project directory and the deploy folder of the AS, specified in the build.xml file.

There are some fields that sit within the Java classes that need to be modified. They were used during development to quickly find directories or URLs. They are:

1.csc457.beans.SendEmailBean:39 – email address used in the “from” field. This address is spoofed, assuming that mailer daemon is on the same machine as the AS.

csc457.nonee.SearchRunnable:71 – Custom email message meant to notify the user that the query they asked for has been completed.

Any call to the EntityManagerBean or EntityManager will require modifications if the tables are changed in anyway. For example, line 20 of csc457.nonee.UserBeanState

In two places in csc457.servlet.ReceiveRequestServlet:(28, 93) there is a static subject and there is static text that represents the body of the email message intended to remind a user that they started a query, respectively.

A simple fix for all of these statics is to include an external configuration file that the web servicereferences.

From Here

Security. None of the session data or passwords are encrypted; everything sent between the server and client is sent in plain text. The information sent between client server should be encrypted over Secure Socket Layer (SSL), and the datasource itself should be secured.

The traffic directed at the GEO Profiles databases. There are explicit policies that regulate how much traffic is allowed to be directed at the servers. This web service does not abide by those policies. This was not a concern during development, however if this web service is deployed for public use and sees a decent amount of traffic, it is likely that NCBI will blacklist the address of whoever is hosting the web service. Therefore this concern is paramount. Please see http://www.ncbi.nlm.nih.gov/books/NBK25497/ for more information.

The front end AJAX. There are a few display bugs from improperly formed HTML that prevent a set of text from being sized or colored properly. Beyond that, the interface is very simple looking, and could use some more aesthetic colors and graphics, as well as some text which explains the service and functions. It would provide a more pleasing environment for the user.

Coding. There are too many static variables throughout the code. This is an easy fix by
just referencing an external configuration file in which these values can be changed at any point
without a re-build and re-deploy of the web service. There are TODO statements all over
the project. By searching these, you will see several smaller issues that need addressing, such as determining value of a particular object, or need for modification of a method.

Conclusions

This has been an incredible learning experience for me. On a personal note, I have come to
realize that JavaEE is where I would like to direct my career. Throughout the duration of this project I have learned much more about JavaEE than I am sure most students have an opportunity to be exposed to. JavaEE is a very complicated set of libraries that, while is well documented, does not have as many resources to learn from as JavaSE does.
As an example, there are many more tutorials on the Internet on how arrays work as opposed to why a NoSuchEJBException might be thrown in context to retrieving an EJB from a HttpSession object.

Having as much free maneuvering room to build this web service as I saw fit, and
to test and play with different configurations and classes and implementations was invaluable
experience that I don't foresee having ever again. This project gave me very useful exposure to the transmission of data between a server and client, and what is most efficient. Overall, this was a positive experience, and exposure to another field in the sciences helped me get a feel for what I can expect in the future with my career.