Jillion Git

Brought to you by: bishbrian, dkatzel, jchriste-jcvi, pamedeo

Tree [1a6059] master /

History

HTTPS access

File	Date	Author	Commit
.settings	2017-07-07	Dan Katzel	[ee0ac9] added missing LGPL copyright headersfixed eclip...
doc	2013-01-15	Dan Katzel	[6964ac] moved primer and align packages to new jillion ...
examples	2017-07-07	Dan Katzel	[ee0ac9] added missing LGPL copyright headersfixed eclip...
src	2017-07-09	Dan Katzel	[1a6059] Added more unit tests fixed bug in FastqWriterA...
test	2017-07-09	Dan Katzel	[1a6059] Added more unit tests fixed bug in FastqWriterA...
.classpath	2017-07-07	Dan Katzel	[9ff3ac] removed deleted folders from eclipse classpath
.project	2017-07-07	Dan Katzel	[ee0ac9] added missing LGPL copyright headersfixed eclip...
COPYING.txt	2015-09-08	Dan Katzel	[5eeb54] updated copyright to include year 2015
README.md	2016-08-29	Dan Katzel	[90162d] updated change log
README.txt	2015-10-26	Dan Katzel	[dc4c02] updated README with Maven instructions and java 8
change_log.txt	2017-07-09	Dan Katzel	[1a6059] Added more unit tests fixed bug in FastqWriterA...
checkstyle_checks.modified.xml	2017-07-03	James Christensen	[31bb06] Remove ant build and associated files. Move che...
jillion-pmd-ruleset.xml	2013-01-10	Dan Katzel	[ab11ca] renamed pmd rulesset
license.txt	2015-09-19	Dan Katzel	[881ec0] updated Jillion 5 to use LGPL 2.1 instead of GPL 3
pom.xml	2017-07-07	Dan Katzel	[9ff3ac] removed deleted folders from eclipse classpath

Read Me

Jillion 5 requires Java 8 and uses new Java 8 language features such as default methods, lambda expressions and the new Stream and Collector API.
if you are still stuck on Java 7, then you must use Jilion 4.

How to install

Using downloadable jar

Just include the downloadable jar on your classpath. Jillion does not require any other dependencies (other than the JVM)

Building from source

Jillion 5+ uses Maven to build and package a jar file. From the root folder where the pom.xml file is type on the command line

% mvn clean install

This will build jillion, run all the unit and integration tests and install it in your local repository.

Jillion is now ready to use.

5.2 Release Notes

New Features

Added new method to fastq writer to automatically trim given a Range.
This saves users the trouble of creating SequenceBuilders and trimming themselves.
Added new method to FastqRecord to get the average Quality of the quality sequence.
The default implementation calls getQualitySequence().getAvgQuality() but some implementations
use a more efficient version.
Added new QualityTrimmer SlidingWindowQualityTrimmer which acts like Trimmomatic's SLIDINGWINDOW option.
Added new convenience methods to NucleotideTrimmer and QualityTrimmer that take Builders. This is really useful
when performing multiple trimming operations in serial since some trimmers may be able to save CPU cycles
and work directly from the builders.
Added new TrimmerPipeline and TrimmerPipelineBuilder classes which can take multiple NucleotideTrimmers
and QualityTrimmers and combine the trimming results for you.
Added SamFileDataStore and SamFileDataStoreBuilder to finally provide a higher level API for
working with sam and bam files without needing to use a low level Visitor.
Added Optional<file> getFile() to FastqParser and refactored CasParser
implementations to make it easier to extend cas file parsing.</file>
Add lambda hook to CasFileTransformationService to override how FastqDataStore is generated so
users could provide their own implementation.
Added new ConsensusCollectors class that can take Streams of various sequence inputs and compute a consensus.
Added new TraceDirPhdDataStoreBuilder class that can make a PhdDataStore implementation from a folder of sanger trace files.
AbiChromatogramParser - Added support for ABI 3500 abi files.

API Changes

Added Trace.getLength()
Added default methods to Rangeable for getLength() getBegin(), getEnd() and isEmpty() since
that is used the most don't have to always build a new Range object.
Added Range.Builder intersect methods
Changed TrimmerPipeline methods to be faster by making fewer Range objects and working off of Range.Builders instead.
Added new Range.toString() methods that take lambda expressions so users can make their
own toString implementations. There are several overloaded versions:
- toString(RangeToStringFunction)
- toString(RangeToStringFunction, CoordinateSystem)
- toString(RangeAndCoordinateSystemToStringFunction)
- toString(RangeAndCoordinateSystemToStringFunction, CoordinateSystem)
to let users convert to different coordinate systems and to
include that coordinate system in the lambda expression or not.
Added toGappedRange( Range) and toUngappedRange( Range) to ResidueSequence
with default implementations and more efficient implementation when the codec
knows it doesn't have gaps. Changed AssemblyUtil to use that instead of its own implementation.
Added toUngappedRange( Range) to NucleotideSequenceBuilder
DataStoreException now extends IOException - This is a Breaking Change if you had code that
caught only DataStoreException and not IOException of had code that used a multi-catch to catch
both an IOException and a DataStore Exception will now cause a compiler error if left unchanged.
Added new StreamingIterator.empty() method

Bug Fixes

BlastParser - fixed bug in XML Blast Parser when it sometimes accidentally set percent identity to be (1 - percent identity).

5.1 Release Notes

New Features

Added new methods to FastaDataStore getSequence( id) which gets just the sequence
and is equivalent to get(id).getSequence().
Added new methods FastaDataStore.getSubSequence( id, offset) which gets just the sequence
starting from the given offset.
Added new methods to FastaDataStore.getSubSequence( id, range) which gets just the sequence
that intersects the given range.
Added support for Fasta Index Files (.fai) files to NucleotideFastaDataStore.
The NucleotideFastaFileDataStoreBuilder object can now be given an fai file
or auto-detect one and use that to make a more efficient implementation
to be used with the new getSequence() or getSubSequence() methods.
Added support for writing Fasta Index Files (.fai) files to NucleotideFastaWriter using
the createIndex(true) method. This will make an additional file named $outputFasta.fai.
Supports normal, zipped and non-redundant fasta files.
Added new class FaiNucleotideWriterBuilder that can create new Fasta Index Files (.fai) for
existing fasta files. The builder object supports fully configuration of the fai to be written
including the output path, the end of line character, and the Charset.
Improved JavaDoc
BlosumMatrices class added support for Blosum30 and 40.

API Changes

FastqFileParser.canAccept() renamed to canParse() to match the other parsers.
Created new abstract class AbstractReadCasVisitor which is now the parent class of AbstractAlignedReadCasVisitor. The new class handles iterating over the input read files to link cas alignments to their read names, sequences and qualities.
Now you can extend that class if you want that extra information without realigning to gapped references.
To Fix OSGi issues, Some classes that were in jillion.internal were moved to jillion.shared since all internal classes can't be exported by OSGI. These classes should not be considered part of the public API and should only be for internal use.
Moved FastaUtil to internal package since it should not be used outside of Jillion classes. Heavily refactored it.

Bug Fixes

PositionSequence - sanger PositionSequence.iterator(Range)
had off by 1 bug that did not include the last base in the range.
StreamingIterator - abstract class that many StreamingIterators extend to use background thread
to populate iterator has been improved to fix occasional dead lock issues if the background thread throws exceptions.

5.0 Release Notes

Jillion 5 License change.

Jillion 5 is now LGPL 2.1. Previous versions of Jillion were GPL 3.
This change follows similar bioinformatics libraries such as BioJava which should allow
users to switch their code to use Jillion instead without any worries about license issues.

Jillion 5 is now OSGI compliant module.

All classes except for those under org.jcvi.jillion.internal.* are exported.

This release notes do not cover all changes, there are too many to list.
Only some of the most important changes are listed.

For a complete list, please consult the change_log.txt file.

API changes:

Java 8 Support - Java 8 Lambda Support to many APIs. Most notably in
many of the filter() methods on various Builder objects.
Various Jillion Filter interfaces such as DataStoreFilter, ReadFilter and SliceElementFilter
now extend the new Java 8 Predicate interface which allows client code to use simple Java 8
Lambda expressions to filter their data.

For example, to make a CoverageMap object for only forward reads of a contig you can now do this:
```
    new ContigCoverageMapBuilder<>(contig)
            .filter(read -> read.getDirection() == Direction.FORWARD)
            .build()
```
Many DataStoreBuilder objects have a new filterRecords( Prediate<t> ) method to only include
only records that match the Java 8 Predicate. DataStoreFilters and Java 8 Lambda Expressions are valid input.</t>

For example, to make a NucleotideFastaDataStore where all the sequences are > 1000bp :

         new NucleotideFastaFileDataStoreBuilder(fastaFile)
                    .filterRecords(record-> record.getLength() >1000)
                    .build();

FastqFileParser and FastaFileParser will now auto-detect zip and gzipped Files and handle the
decompression for you. Previously you had to provide a decompressed InputStream which could only
be parsed once. Now the Files can be parsed multiple times.
Sam / BAM API changes - a lot of work was done to improve SAM and BAM support, some improvements
required API changes.
Changed SamVisitor API to remove visitRecord(Callback, SamRecord) which was only
called when visiting SAM files. All records for both SAM and BAM files now use
```
visitRecord(Callback callback, SamRecord record, VirtualFileOffset start, VirtualFileOffset end)
```

where the offset values are either the offsets into the BAM for bam encoded files or null for sam files.
This removes a lot of confusion and duplicated code when dealing with parsing both formats.

SamParserFactory.create(File) will now check to see if there is an indexed bam file
using the samtools naming conventions, and if so, uses an optimized
SamParser object that can use the index to randomly access reads by reference or alignment region.
SamRecord.Builder is now pulled out into its own class SamRecordBuilder
SamRecord from a class to an interface. The old SamRecord class is now package private.
All API methods in sam package now use the new SamRecord interface instead of the old class.
Created new SamAttributed interface which has the methods hasAttribute(...) and getAttribute(...)
SamRecord and SamRecordBuilder now both implement this interface.
Added additional parameter to SamAttributeValidator to add a SamAttributed instance. This will be
the source that the attribute is from. This allows new validators to be written to check other attributes
from the same source.

New Methods Added:

Java 8 Support - added new methods to several classes that return Java 8 Streams. Including:
Contig#reads() which returns a Stream<assembledread></assembledread>
StreamingIterator#toStream() which converts a Jillion StreamingIterator into a Stream<t>.
Please remember to close the Stream when done.</t>
Added new default method SamAttributeValidator#thenComparing(SamAttributeValidator other)
which returns a new SamAttributeValidator that checks both validators in a chain and only
passes if both validators pass the attribute. Uses a similar construction to the new
Java 8 Comparator.thenComparing(...) methods.
FastaRecord and FastqRecord - Added new method getLength() to both FastaRecord and FastqRecord
which returns the length of the sequence. Some implementations may use an optimized way
to compute the length instead of querying the wrapped sequence object.
Sorting Fasta and Fastq Writers - Nucleotide, Protein, Quality and Position FastqWriterBuilder
and FastaWriterBuilders can now sort records using a Comparator.
Both in-memory only and using temp files to sort all the records are supported. Using the temp
files to help with sorting allows the writing very large sorted output files that would
not have been able to all fit in memory.
An additional overloaded sort() method takes a File object that is the directory to create the temp files in
(default directory is System temp).
Sam/Bam Parser - Added new methods that only parse alignments for specific reference names and regions.
Added SamParser.parse(String referenceName, SamVisitor visitor) which will only visit the SamRecords
in the file that map to the given reference. Some implementations may use the bam index to quickly
seek to the part of the bam file where the alignments for those references are stored.
Added SamParser.parse(String referenceName, Range alignmentRange, SamVisitor visitor) which
will only visit the SamRecords in the file that map to the given reference. Some
implementations may use the bam index to quickly seek to the part of the bam file
where the alignments for those references are stored.
Added new helper method SamRecord.getAlignmentRange() which returns a Range that the record aligned to the read.

New Classes Added:

SplitFastaWriter and SplitFastqWriter - Added new SplitFastaWriter and SplitFastqWriter classes
which have 3 factory methods to make different Writer implementations that split up
writing records to different files using different strategies :
roundRobin(), rollover() and deconvolve() each method takes a lambda function to create
the new individual writers and deconvolve() takes a second lambda which determines which
output file the record will go to.
GenomeStatistics - New utility class for computing different statistical measurements about genomes
(for example N50). It uses the new Java 8 Collector interface.

For example to compute the N50 of all the records in a Fasta file:

try(NucleotideFastaDataStore datastore = new NucleotideFastaFileDataStoreBuilder(fastaFile)
                                                    .hint(DataStoreProviderHint.ITERATION_ONLY)
                                                    .build();

    Stream<NucleotideFastaRecord> stream = datastore.iterator().toStream();
    ){
        OptionalInt n50Value = stream
                                    .map(fasta -> fasta.getLength())
                                    .collect(GenomeStatistics.n50Collector());

        //return value is optional because there might not be any records!
        if(n50Value.isPresent()){
            System.out.println("N50 = " + n50Value.getAsInt());
        }
    }

CoverageMapCollectors - New utility class for creating Java 8 Collector objects that create
CoverageMap objects.

For example, if you had a contig and wanted a coverage map of the alignment locations of
just the forward reads capped to a max of 200x coverage the code would look like this:

     CoverageMap<Range> forwardCoverageMap200x = contig.reads()
                    .filter(read -> read.getDirection() == Direction.FORWARD)
                    .map(AssembledRead::asRange)
                    .collect(CoverageMapCollectors.toCoverageMap(200));

LucyVectorSpliceTrimmer - Performs vector splice trimming using a simplified version
of the algorithm that the TIGR program Lucy used. Takes a NucleotideSequence object as input
and returns the Range that is vector free.

Performance Improvements

Fastq File Parsing and Writing - Previous versions of Jillion had terribly slow fastq parsing and writing
that was 3-5x slower than other libraries. A lot of effort was put into Jillion 5
to make it at least as fast as similar libraries. The end result is Jillion 5 is
now just as fast or faster than other libraries such as BioJava and Picard
when parsing fastq data for the most common use cases.
When not using Mementos or DataStoreProviderHint.RANDOM_ACCESS_OPTIMIZE_MEMORY (which uses mementos)
A new faster parsing implementation is used that doesn't need to keep track of file offsets. This improves
parsing time by 400 %
Improved FastqWriting - The most common use case of parsing a fastq file and writing out the FastqRecord instances as is
to a different writer. New internal classes are now used which don't convert the encoded quality strings
into QualitySequence objects unless getQualitySequence() is called. This takes up slightly more memory
per record. This usually isn't an issue because most of the time the files are streamed as ITERATION_ONLY
so the records will be GC'ed as soon as they are out of scope in the iterator.

When tested on large 25 million read fastq files from 1000genomes project, throughput improved by more than 25%.

ITERATION_ONLY data processing improvements - when using ITERATION_ONLY,
certain expensive memory optimizations are turned off. This improves runtime performance but up slightly more memory
per record.

Bug Fixes

Generate 454 Universal Accession number did not
generate valid id if the location x,y coordinates were very small.
Bug Fix in SAM and BAM header writer which incorrectly wrote out the MD5 values of the references as "MD5"
instead of the actual md5 hash value.
Bug Fix in SAM and BAM header writer which incorrectly wrote out the URI path to the reference file to be the md5 value
instead of the actual path.
Bug Fix in BAM writer which incorrectly computed BAM bin.
Bug Fixes in BAM index writer which incorrectly computed BAM bin and intervals.
AceFileParser - more lenient Consensus Tag timestamp parsers to support CLC Workbench ace output
which doesn't follow the ace file spec regarding timestamp resolution.

Jillion installation instructions

Use Download Jar:

Down load the latest Jillion jar file and then put it in your classpath.

To build from Source:

Jillion has both a Maven POM file as well as an Apache ANT file that can both be used
to build source and test files. So use which ever is easier for you to (Maven is recommended).

Jillion 5 requires Java 8 or higher to run.

To Build with Maven

Once Java and Maven are installed on your system,
from the root directory of a Jillion check-out type:

%mvn clean install

This will build jillion, run all the unit and integration tests and installs it in your local repository.
Jillion is now ready to use.

To Build with Ant

Once Java and Ant are installed on your system,
from the root directory of a Jillion check-out type:

%ant release

This will compile all source files and create a new file in the root directory
named "Jillion-${version}.jar"

Then put the build jar in your classpath.

Bug Reports:

Please report any bugs to the Bug Tracker on Jillion's sourceforge page:

https://sourceforge.net/p/jillion/bugs/

Please include the version and SVN revision number if you know it in any bug reports.

Thank you,

Danny Katzel

Jillion Git

Branches

Tags

Tree [1a6059] master /

History

Read Me

How to install

Using downloadable jar

Building from source

5.2 Release Notes

New Features

API Changes

Bug Fixes

5.1 Release Notes

New Features

API Changes

Bug Fixes

5.0 Release Notes

Jillion 5 License change.

Jillion 5 is now OSGI compliant module.

API changes:

New Methods Added:

New Classes Added:

Performance Improvements

Bug Fixes

Jillion installation instructions

Use Download Jar:

To build from Source:

To Build with Maven

To Build with Ant

Bug Reports:

Jillion Git

Branches

Tags

Tree [1a6059] master / Download Snapshot History

Read Me

How to install

Using downloadable jar

Building from source

5.2 Release Notes

New Features

API Changes

Bug Fixes

5.1 Release Notes

New Features

API Changes

Bug Fixes

5.0 Release Notes

Jillion 5 License change.

Jillion 5 is now OSGI compliant module.

API changes:

New Methods Added:

New Classes Added:

Performance Improvements

Bug Fixes

Jillion installation instructions

Use Download Jar:

To build from Source:

To Build with Maven

To Build with Ant

Bug Reports:

Tree [1a6059] master /

History