selectseq

A command-line utility to manipulate biological sequences from a FASTA or FASTQ file. It can, given a list of identifiers, get only a subset of the sequences (or their complement, i.e., sequences NOT in the list). Can also get sequence number N only. Compressed sequences files are supported if readable by zcat.

Features

collect only some sequences out of a large FASTA or FASTQ file
get sequence number N only, regardless of ID
complement mode: return all sequences that are NOT in the list of IDs
"matching" mode: choose which part (between | characters) of the ID should match
sequence names provided one per line in a text file (first word in line used, or whatever is given to the -k option)
the > and @ symbols are ignored if present in the beginning of IDs in the list (useful if using FASTA or FASTQ identifiers)
if only one sequence is needed, its ID can be given directly to the -l option (no need of a file)
add a suffix to IDs before searching (useful when IDs come from proteins that have _1 in the ID, but genes do not)
compressed sequence database files (-s) are supported
quite mode, output only important warnings and errors

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow selectseq

selectseq Web Site

Other Useful Business Software

Keep company data safe with Chrome Enterprise

Protect your business with AI policies and data loss prevention in the browser

Make AI work your way with Chrome Enterprise. Block unapproved sites and set custom data controls that align with your company's policies.

Download Chrome

Rate This Project

User Reviews

Be the first to post a review of selectseq!

Additional Project Details

Intended Audience

Science/Research

User Interface

Command-line

Programming Language

Perl

Related Categories

Perl Bio-Informatics Software

Registered

2011-05-20

Similar Business Software

BioTuring Browser

Explore hundreds of curated single-cell transcriptome datasets, along with your own data, through interactive visualizations and analytics. The software also supports multimodal omics, CITE-seq, TCR-seq, and spatial transcriptomic. Interactively explore the world's largest single-cell expression...

See Software
Geneyx

Geneyx Analysis is a comprehensive solution for next-generation sequencing (NGS) data that can scale the process of FASTQ to clinical reports for hospital and commercial labs. This advanced platform integrates machine learning and AI-based features to identify novel biomedical insights, while...

See Software
Galaxy

Galaxy is an open source, web-based platform for data-intensive biomedical research. If you are new to Galaxy start here or consult our help resources. You can install your own Galaxy by following the tutorial and choosing from thousands of tools from the tool shed. This instance of Galaxy is...

See Software
Illumina DRAGEN Secondary Analysis

The Illumina DRAGEN Secondary Analysis provides accurate, comprehensive, and efficient analysis of next-generation sequencing data. Graph reference genome and machine learning driving unprecedented accuracy. Provides ultra-efficient workflow; can fully process a 34x whole human genome in ~30...

See Software
Evo Designer

Evo Designer is an advanced tool developed by the Arc Institute, leveraging the capabilities of the Evo 2 genomic foundation model to facilitate DNA sequence generation and analysis. This platform enables users to input nucleotide sequences or specify organisms, prompting the model to generate...

See Software
Genome Analysis Toolkit (GATK)

Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. The GATK...

See Software