Menu

System Architecture

glycolab Dinko Soic

System Architecture

Pipeline Overview

The Oxonium Browser processes shotgun proteomics data through a sequential pipeline, from raw mzML input to an interactive dashboard.

┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│ SAGE Database │───▶│     mzML      │───▶ │  Oxonium Ion  │───▶ │    Results   │───▶ │  Interactive  │
│    Search     │     │  Calibration  │     │   Detection   │     │  Processing  │      │   Dashboard   │
└───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘     └───────────────┘

Data Flow

Input Phase

The pipeline requires three input files:

  • mzML file — MS2 spectra, pre-converted from vendor RAW format (centroided, 64-bit encoding recommended)
  • FASTA file — protein sequences for the organism of interest
  • Excel file — sugar oxonium ion definitions with diagnostic mass pairs

Processing Phase

  1. SAGE database search identifies peptide spectra for exclusion
  2. Two-pass recalibration improves mass accuracy across all MS2 spectra
  3. Remaining (non-peptide) spectra are scanned for oxonium ion pairs
  4. Results are organized, scored, and prepared for visualization

Output Phase

  • Summary Excel file with detection metrics per oxonium ion
  • Detailed Excel file with per-scan detection data and retention times
  • Mass error calibration diagnostic plot
  • SAGE search results (TSV, PeptideShaker-compatible)
  • Interactive dashboard served at http://localhost:8051

Module Descriptions

1. Pre-processing (External)

Purpose: Convert vendor-specific RAW files to open-format mzML.

Recommended tool: ProteoWizard MSConvert. Use centroided mzML with 64-bit encoding (32-bit acceptable for Astral data to reduce file size). The Docker container cannot process RAW files directly due to Windows-native library requirements.

2. SAGE Database Search (pysage_v6_scanner.py)

Purpose: Identify unmodified peptide spectra to exclude from glycopeptide analysis.

The module wraps the SAGE search engine (via sagepy) to perform fast peptide-spectrum matching. It uses a target-decoy approach for FDR control and returns scan numbers of identified peptides, which are then skipped during oxonium ion detection.

Key settings:

  • Enzyme: Trypsin (KR, not before P)
  • Static modifications: Carbamidomethylation (C)
  • Variable modifications: Oxidation (M)
  • Default FDR threshold: 1%

Results are cached — if a SAGE output file already exists for the input mzML, the search is skipped and cached results are reused.

3. mzML Recalibration (mzml_recalibration_v6.py)

Purpose: Read mzML files and improve mass accuracy through two-pass calibration.

Pass 1 — Global calibration:
Matches seven amino acid fragment reference peaks (147.113, 175.119, 201.123, 215.139, 228.134, 258.145, 292.129 m/z) across all spectra at 20 ppm tolerance. Requires a minimum of 500 matched spectra per reference peak. Fits a global linear calibration function using least squares regression and applies it to all spectra.

Pass 2 — Per-spectrum calibration:
After global calibration, each spectrum is individually recalibrated at tighter 10 ppm tolerance. Requires at least 3 matched reference peaks per spectrum for a stable linear fit. Spectra with insufficient matches retain the global calibration.

A diagnostic plot (mass_error_two_pass_calibration.png) shows error distributions at each stage: original, after global calibration, and after per-spectrum calibration.

4. Oxonium Ion Detection (get_oxonium_scans_v5.py)

Purpose: Scan non-peptide MS2 spectra for diagnostic sugar oxonium ion pairs.

For each spectrum not identified by SAGE, the scanner checks whether both diagnostic masses (oxonium ion and its water loss fragment) are present within the defined mass error tolerance. If both are found and the average normalized intensity exceeds the threshold, the detection is recorded.

Detection metrics computed per oxonium ion:

  • Normalized presence — percentage of all spectra containing the ion pair
  • Count — total number of spectra with positive detection
  • Normalized intensity — average intensity relative to total spectrum intensity
  • Average intensity — raw average intensity across positive detections

5. Results Processing (process_results.py)

Purpose: Organize raw detection results into structured datasets.

Separates test mass controls (names starting with Ox_test_) from real sugar detections, sorts results by normalized presence, and prepares DataFrames for Excel export and dashboard visualization.

6. Interactive Dashboard (ox_scanner_dash_v24.py)

Purpose: Provide interactive visualization and exploration of results.

Built with Plotly Dash, the dashboard offers real-time filtering, multiple visualization types, and export functionality. See the Dashboard Guide for a full walkthrough.

Key components:

  • Threshold filter controls with auto-adjustment for chemspace mode
  • Database toggle (curated / chemspace / both) when chemspace search is enabled
  • Sortable match table grouped by ±18 Da water loss families
  • Clustered co-occurrence heatmap with Jaccard-based dendrogram
  • Retention time profile viewer with checklist selection
  • Mass error distribution plots from recalibration
  • Export buttons for table and selected ion data

7. Main Script Orchestration (main_script.py)

Purpose: Coordinate the entire workflow.

Reads environment variables for configuration, discovers input files, optionally merges the chemspace database with the user-provided curated list (deduplicating overlapping masses), runs each pipeline stage in sequence, exports results to Excel, and launches the dashboard.

Important Considerations

Mass-Based Detection Limitations

Oxonium Browser identifies sugars based on diagnostic mass, but cannot differentiate between isomeric sugars. When a hexose (Hex) is detected, additional experiments or literature review are needed to determine whether it is glucose, galactose, mannose, or another hexose isomer. The tool provides evidence of glycosylation and sugar mass; structural characterization requires complementary techniques.

File Format Requirements

The Docker version requires pre-converted mzML files. Direct RAW file analysis is not supported due to vendor library compatibility limitations in Linux containers.

Memory Requirements

The required memory is approximately equal to the mzML file size. For Astral data, ensure sufficient available RAM and consider using 32-bit encoding and spectral density reduction during conversion.

Back to Home


MongoDB Logo MongoDB