Download Latest Version 6.3.3 source code.tar.gz (259.1 MB)
Email in envelope

Get an email when there's a new version of Spark NLP

Home / 6.3.2
Name Modified Size InfoDownloads / Week
Parent folder
6.3.2 source code.tar.gz 2026-01-27 258.0 MB
6.3.2 source code.zip 2026-01-27 439.7 MB
README.md 2026-01-27 6.9 kB
Totals: 3 Items   697.7 MB 0

📢 Spark NLP 6.3.2: Scala 2.13 Support, Layout-Aware Images, and Enhanced LightPipeline Tracking

Spark NLP 6.3.2 is a foundational release that introduces official support for Scala 2.13, alongside important improvements in document layout understanding and lightweight inference workflows. This release improves long-term model portability through JSON-based serialization, enriches document image extraction with spatial metadata, and enhances LightPipeline with document ID tracking and output filtering.

🔥 Highlights

  • Official Scala 2.13 support
  • Layout-aware image extraction with spatial coordinates added to Reader2Image for HTML, DOCX, and PPTX documents.
  • Enhanced LightPipeline with document ID propagation and output column filtering for better batch inference workflows.

🚀 New Features & Enhancements

Scala 2.13 Support

Spark NLP now supports Scala 2.13 with this release! This will enable you to run your Spark NLP pipelines on Spark versions that run on Scala 2.13, such as used by Databricks and Dataproc. See our Installation Instructions for Scala 2.13 on how to use it with our project.

There are some things you have to consider when using the Scala 2.13 version

  1. You need to adjust your dependency from spark-nlp_2.12 to spark-nlp_2.13.
  2. If you install PySpark from PyPi, then the session will be Scala 2.12 by default. If you need to start a Scala 2.13 instance, you can set the SPARK_HOME environment variable to a Spark Scala 2.13 installation, or install PySpark from the official Spark archives.
  3. If you want to load DependencyParserModel or TextMatcherModel from Scala 2.12 into Scala 2.13, you will need to manually export them again with the latest version. See the notebook

Layout-Aware Image Metadata in Reader2Image

The Reader2Image annotator now extracts spatial image coordinates from rich document formats, adding layout awareness to image annotations.

  • Supported formats:
  • HTML
  • Word (DOCX)
  • PowerPoint (PPTX)
  • New metadata fields:
  • x, y, width, height
  • Coordinates are included alongside existing metadata such as image format, type, and DOM position

This enables:

  • Layout-aware document and multimodal pipelines
  • Visual reconstruction of documents
  • More accurate association of images with surrounding text content

Document ID Support in LightPipeline

LightPipeline now supports passing document IDs together with text inputs, improving traceability in batch and production inference scenarios.

Key capabilities:

  • New overloads:
  • fullAnnotate(ids, texts)
  • annotate(ids, texts)
  • Document IDs are propagated as annotation metadata (doc_id)
  • New output_cols parameter to restrict returned annotation types

Benefits:

  • Reliable document-to-result mapping
  • Easier debugging and downstream integration
  • Reduced memory usage through selective outputs

Existing LightPipeline usage remains unchanged and backward compatible.

🐛 Bug Fixes

  • Fix out of memory error when copying big models to a cloud storage

❤️ Community Support

  • Slack – real-time discussion with the Spark NLP community and team
  • GitHub – issue tracking, feature requests, and contributions
  • Discussions – community ideas and showcases
  • Medium – latest Spark NLP articles and tutorials
  • YouTube – educational videos and demos

💻 Installation

Python

:::bash
pip install spark-nlp==6.3.2

Spark Packages

CPU

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.2

GPU

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.2

Apple Silicon

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.2

AArch64

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.2

Maven

Supported on on Apache Spark 3.x.

spark-nlp

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-gpu

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-silicon

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

spark-nlp-aarch64

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.3.2</version>
</dependency>

FAT JARs

What's Changed

  • [SPARKNLP-1136] JSON Serialization for Features [#14722] by @DevinTDHa
  • [SPARKNLP-1329] Adding image coordinates to metadata for Reader2Image [#14725] by @danilojsl
  • [SPARKNLP-1333] Adding ids input for LightPipeline [#14726] by @danilojsl

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.3.1...6.3.2

Source: README.md, updated 2026-01-27