Download Latest Version 6.4.0 source code.tar.gz (275.1 MB)
Email in envelope

Get an email when there's a new version of Spark NLP

Home / 6.3.3
Name Modified Size InfoDownloads / Week
Parent folder
6.3.3 source code.tar.gz 2026-03-10 259.1 MB
6.3.3 source code.zip 2026-03-10 441.0 MB
README.md 2026-03-10 13.4 kB
Totals: 3 Items   700.1 MB 0

πŸ“’ Spark NLP 6.3.3: ModernBERT Embeddings, Vector DB Integration, and Layout-Aware Document Processing

Spark NLP 6.3.3 is a feature-packed release aimed at practitioners building modern NLP and multimodal document pipelines. This release introduces ModernBertEmbeddings for dramatically faster and more memory-efficient text embeddings, VectorDBConnector to close the gap between embedding pipelines and vector search infrastructure.

Moreover, we introduce a new suite of document-understanding annotators: LayoutAlignerForVision and LayoutAlignerForText which together enable coherent end-to-end multimodal pipelines over complex documents like PDFs and PowerPoint files. To gain an in-depth walkthrough on how to build use your own pipelines and documents, please see our Medium blog post at Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents.

In addition, a new MultiColumnAssembler can merge multiple annotation column into one and LightPipeline also gains metadata column support for more powerful batch inference workflows.

πŸ”₯ Highlights

  • ModernBertEmbeddings: A state-of-the-art encoder that is 8x faster and uses 5x less memory than traditional BERT, with native support for sequences up to 8192 tokens which is ideal for long-document NLP tasks.
  • VectorDBConnector: Bridges Spark NLP embedding pipelines with external vector databases (initially Pinecone), so teams building semantic search and RAG systems no longer need custom glue code to store embeddings.
  • LayoutAlignerForVision and LayoutAlignerForText: New annotators for multimodal document pipelines that spatially align text and images from complex documents, giving downstream Vision-Language Models (VLMs) the coherent context they need to produce better results.
  • MultiColumnAssembler: Closes a common pipeline gap when ReaderAssembler splits document content across multiple columns (text, table, image captions). This annotator merges them back into a single column that downstream annotators like AutoGGUFVisionModel expect.
  • Enhanced LightPipeline with metadata column support for richer, context-aware inference workflows.

πŸš€ New Features & Enhancements

ModernBertEmbeddings

Teams working with long documents, code, or large-scale embedding workloads will benefit from ModernBertEmbeddings, which brings the latest generation of bidirectional encoder models to Spark NLP. Based on the paper Smarter, Better, Faster, Longer, ModernBERT was trained on 2 trillion tokens with a native sequence length of up to 8,192 tokens (eight times the limit of classic BERT) enabling faster, cheaper embeddings for longer sequences without truncation.

  • Default pretrained model: "modernbert-base" (English)
  • 768-dimensional token-level WORD_EMBEDDINGS output

    :::python embeddings = ModernBertEmbeddings.pretrained() \ .setInputCols(["document", "token"]) \ .setOutputCol("modernbert_embeddings")

See the ModernBertEmbeddings notebook for extended examples, including how to import custom HuggingFace ModernBERT models via ONNX.

VectorDBConnector

For teams building semantic search, retrieval-augmented generation (RAG), or similarity-based recommendation systems, manually bridging Spark NLP with a vector database has historically required custom integration code. VectorDBConnector eliminates this gap by letting you store embeddings from any Spark NLP embedding annotator directly into a vector database as part of the pipeline. It initially supports Pinecone, with more providers planned.

:::python
vectorDB = VectorDBConnector() \
    .setInputCols(["document", "sentence_embeddings"]) \
    .setOutputCol("vectordb_result") \
    .setProvider("pinecone") \
    .setIndexName("my-semantic-index") \
    .setNamespace("production") \
    .setIdColumn("doc_id") \
    .setMetadataColumns(["text", "category"]) \
    .setBatchSize(100)

The Pinecone API key is configured via spark.jsl.settings.vectordb.api_key. See the VectorDBConnector Pinecone Demo notebook for a full walkthrough.

LayoutAlignerForVision and LayoutAlignerForText

When processing rich documents like PDFs or PowerPoint presentations, text and images are spatially interleaved. For example, A chart sits next to the paragraph it illustrates, a diagram is surrounded by its explanation. Without layout awareness, VLMs operating on extracted content lose this spatial context entirely. LayoutAlignerForVision and LayoutAlignerForText solve this problem for teams building multimodal document intelligence pipelines.

LayoutAlignerForVision takes document chunks and images extracted by ReaderAssembler and aligns each image with its spatially nearby text paragraphs based on actual page coordinates. It produces three output columns <outputCol>_doc, <outputCol>_image, and <outputCol>_prompt, ready to be fed directly into a VLM (e.g. AutoGGUFVisionModel) for captioning or question answering.

Key parameters:

  • setMaxDistance(int): Maximum vertical distance (px) for image-paragraph alignment
  • setIncludeContextWindow(bool): Include neighboring paragraphs as context for floating images
  • setAddNeighborText(bool): Include aligned text in the prompt output
  • setImageCaptionBasePrompt(str): Customize the captioning prompt sent to downstream VLMs
  • setNeighborTextCharsWindow(int): Include surrounding text characters as prompt context
  • setExplodeDocs(bool): Emit one output row per aligned doc/image pair

LayoutAlignerForText takes the VLM-generated image captions produced after LayoutAlignerForVision and weaves them back into the document's text flow, replacing raw image placeholders with meaningful captions and re-computing begin/end offsets so the resulting document is coherent for downstream NLP tasks.

Key parameters:

  • setJoinDelimiter(str) – Delimiter used to join rebuilt text segments
  • setExplodeElements(bool) – Emit one output row per aligned text element

For more extended examples and walkthroughs, see refer to the notebook Spark NLP LayoutAligners for Document Understanding and our Medium blog post Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents.

MultiColumnAssembler

When using ReaderAssembler to process documents such as PDFs or PPTX files, content is extracted into separate typed columns: document_text, document_table, and image-related outputs. However, many downstream annotators expect a single input column. Previously, bridging this split required custom Spark transformations. MultiColumnAssembler fills this gap directly within Spark NLP pipelines.

It merges any number of DOCUMENT-type annotation columns into a single output column, preserving all annotation metadata and adding a source_column key to track provenance. Annotations can optionally be sorted by their begin offset using setSortByBegin(True).

:::python
multiColumnAssembler = MultiColumnAssembler() \
    .setInputCols(["document_text", "document_table"]) \
    .setOutputCol("merged_document")

Key parameters:

  • setInputCols([...]) – List of DOCUMENT-type annotation columns to merge
  • setOutputAsAnnotatorType(str) – Override the output annotator type (default: "document")
  • setSortByBegin(bool) – Sort merged annotations by begin position (default: False)

Note: Columns using the AnnotationImage schema (i.e., IMAGE-typed columns from ReaderAssembler) are not supported.

See the Merging Annotation Columns notebook for a full walkthrough.

LightPipeline Metadata Support

Users running inference with LightPipeline on data that carries additional context β€” such as document source, language, or category β€” previously had no way to pass that context through alongside the text. LightPipeline now supports passing metadata columns alongside text inputs in both annotate() and fullAnnotate(), enabling richer, context-aware inference for applications like routing, filtering, and conditional processing.

New supported call signatures:

  • fullAnnotate(text: str, metadata: dict[str, list[str]])
  • fullAnnotate(texts: list[str], metadata: list[dict])
  • fullAnnotate(texts: list[str], metadata: dict[str, list[str]]) (columnar format)
  • Same patterns apply to annotate()

Metadata can be passed as a keyword argument or as a positional trailing argument:

:::python
result = light_pipeline.fullAnnotate(
    "U.N. official Ekeus heads for Baghdad.",
    metadata={"source": ["news_article"]}
)

This feature is also surfaced through PretrainedPipeline.annotate() and PretrainedPipeline.fullAnnotate().

πŸ› Bug Fixes

  • Apache POI upgraded to 5.4.1: The Apache POI dependency used by document readers has been upgraded from 4.1.2 to 5.4.1 (poi-ooxml-full) to avoid deprecated dependencies.

❀️ Community Support

  • Slack real-time discussion with the Spark NLP community and team
  • GitHub issue tracking, feature requests, and contributions
  • Discussions community ideas and showcases
  • Medium latest Spark NLP articles and tutorials
  • YouTube educational videos and demos

πŸ’» Installation

Python

:::bash
pip install spark-nlp==6.3.3

Spark Packages

CPU

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.3

GPU

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.3

Apple Silicon

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.3

AArch64

:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.3

Maven

Supported on Apache Spark 3.x.

spark-nlp

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.3.3</version>
</dependency>

spark-nlp-gpu

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.3.3</version>
</dependency>

spark-nlp-silicon

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.3.3</version>
</dependency>

spark-nlp-aarch64

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.3.3</version>
</dependency>

FAT JARs

What's Changed

  • [SPARKNLP-1334] Updating POI dependency for readers [#14727] by @danilojsl
  • [SPARKNLP-1335] Implement Layout Aligner annotators [#14737] by @danilojsl
  • [SPARKNLP-1336] Enhancements to LightPipeline [#14734] by @danilojsl
  • [SPARKNLP-1287] vector db connector annotator [#14729] by @ahmedlone127
  • [SPARKNLP-1231] implement modern bert embeddings [#14736] by @ahmedlone127
  • [SPARKNLP-1341] Add MultiColumnAssembler [#14743] by @AbdullahMubeenAnwar

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.3.2...6.3.3

Source: README.md, updated 2026-03-10