| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Parent folder | |||
| 6.3.3 source code.tar.gz | 2026-03-10 | 259.1 MB | |
| 6.3.3 source code.zip | 2026-03-10 | 441.0 MB | |
| README.md | 2026-03-10 | 13.4 kB | |
| Totals: 3 Items | 700.1 MB | 0 | |
π’ Spark NLP 6.3.3: ModernBERT Embeddings, Vector DB Integration, and Layout-Aware Document Processing
Spark NLP 6.3.3 is a feature-packed release aimed at practitioners building modern NLP and multimodal document pipelines. This release introduces ModernBertEmbeddings for dramatically faster and more memory-efficient text embeddings, VectorDBConnector to close the gap between embedding pipelines and vector search infrastructure.
Moreover, we introduce a new suite of document-understanding annotators: LayoutAlignerForVision and LayoutAlignerForText which together enable coherent end-to-end multimodal pipelines over complex documents like PDFs and PowerPoint files. To gain an in-depth walkthrough on how to build use your own pipelines and documents, please see our Medium blog post at Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents.
In addition, a new MultiColumnAssembler can merge multiple annotation column into one and LightPipeline also gains metadata column support for more powerful batch inference workflows.
π₯ Highlights
ModernBertEmbeddings: A state-of-the-art encoder that is 8x faster and uses 5x less memory than traditional BERT, with native support for sequences up to 8192 tokens which is ideal for long-document NLP tasks.VectorDBConnector: Bridges Spark NLP embedding pipelines with external vector databases (initially Pinecone), so teams building semantic search and RAG systems no longer need custom glue code to store embeddings.LayoutAlignerForVisionandLayoutAlignerForText: New annotators for multimodal document pipelines that spatially align text and images from complex documents, giving downstream Vision-Language Models (VLMs) the coherent context they need to produce better results.MultiColumnAssembler: Closes a common pipeline gap when ReaderAssembler splits document content across multiple columns (text, table, image captions). This annotator merges them back into a single column that downstream annotators likeAutoGGUFVisionModelexpect.- Enhanced
LightPipelinewith metadata column support for richer, context-aware inference workflows.
π New Features & Enhancements
ModernBertEmbeddings
Teams working with long documents, code, or large-scale embedding workloads will benefit from ModernBertEmbeddings, which brings the latest generation of bidirectional encoder models to Spark NLP. Based on the paper Smarter, Better, Faster, Longer, ModernBERT was trained on 2 trillion tokens with a native sequence length of up to 8,192 tokens (eight times the limit of classic BERT) enabling faster, cheaper embeddings for longer sequences without truncation.
- Default pretrained model:
"modernbert-base"(English) -
768-dimensional token-level
WORD_EMBEDDINGSoutput:::python embeddings = ModernBertEmbeddings.pretrained() \ .setInputCols(["document", "token"]) \ .setOutputCol("modernbert_embeddings")
See the ModernBertEmbeddings notebook for extended examples, including how to import custom HuggingFace ModernBERT models via ONNX.
VectorDBConnector
For teams building semantic search, retrieval-augmented generation (RAG), or similarity-based recommendation systems, manually bridging Spark NLP with a vector database has historically required custom integration code. VectorDBConnector eliminates this gap by letting you store embeddings from any Spark NLP embedding annotator directly into a vector database as part of the pipeline. It initially supports Pinecone, with more providers planned.
:::python
vectorDB = VectorDBConnector() \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("vectordb_result") \
.setProvider("pinecone") \
.setIndexName("my-semantic-index") \
.setNamespace("production") \
.setIdColumn("doc_id") \
.setMetadataColumns(["text", "category"]) \
.setBatchSize(100)
The Pinecone API key is configured via spark.jsl.settings.vectordb.api_key. See the VectorDBConnector Pinecone Demo notebook for a full walkthrough.
LayoutAlignerForVision and LayoutAlignerForText
When processing rich documents like PDFs or PowerPoint presentations, text and images are spatially interleaved. For example, A chart sits next to the paragraph it illustrates, a diagram is surrounded by its explanation. Without layout awareness, VLMs operating on extracted content lose this spatial context entirely. LayoutAlignerForVision and LayoutAlignerForText solve this problem for teams building multimodal document intelligence pipelines.
LayoutAlignerForVision takes document chunks and images extracted by ReaderAssembler and aligns each image with its spatially nearby text paragraphs based on actual page coordinates. It produces three output columns <outputCol>_doc, <outputCol>_image, and <outputCol>_prompt, ready to be fed directly into a VLM (e.g. AutoGGUFVisionModel) for captioning or question answering.
Key parameters:
setMaxDistance(int): Maximum vertical distance (px) for image-paragraph alignmentsetIncludeContextWindow(bool): Include neighboring paragraphs as context for floating imagessetAddNeighborText(bool): Include aligned text in the prompt outputsetImageCaptionBasePrompt(str): Customize the captioning prompt sent to downstream VLMssetNeighborTextCharsWindow(int): Include surrounding text characters as prompt contextsetExplodeDocs(bool): Emit one output row per aligned doc/image pair
LayoutAlignerForText takes the VLM-generated image captions produced after LayoutAlignerForVision and weaves them back into the document's text flow, replacing raw image placeholders with meaningful captions and re-computing begin/end offsets so the resulting document is coherent for downstream NLP tasks.
Key parameters:
setJoinDelimiter(str)β Delimiter used to join rebuilt text segmentssetExplodeElements(bool)β Emit one output row per aligned text element
For more extended examples and walkthroughs, see refer to the notebook Spark NLP LayoutAligners for Document Understanding and our Medium blog post Efficient Document Ingestion with Layout Aware Annotators: A Case Study on Mixed-Type Documents.
MultiColumnAssembler
When using ReaderAssembler to process documents such as PDFs or PPTX files, content is extracted into separate typed columns: document_text, document_table, and image-related outputs. However, many downstream annotators expect a single input column. Previously, bridging this split required custom Spark transformations. MultiColumnAssembler fills this gap directly within Spark NLP pipelines.
It merges any number of DOCUMENT-type annotation columns into a single output column, preserving all annotation metadata and adding a source_column key to track provenance. Annotations can optionally be sorted by their begin offset using setSortByBegin(True).
:::python
multiColumnAssembler = MultiColumnAssembler() \
.setInputCols(["document_text", "document_table"]) \
.setOutputCol("merged_document")
Key parameters:
setInputCols([...])β List ofDOCUMENT-type annotation columns to mergesetOutputAsAnnotatorType(str)β Override the output annotator type (default:"document")setSortByBegin(bool)β Sort merged annotations by begin position (default:False)
Note: Columns using the
AnnotationImageschema (i.e., IMAGE-typed columns fromReaderAssembler) are not supported.
See the Merging Annotation Columns notebook for a full walkthrough.
LightPipeline Metadata Support
Users running inference with LightPipeline on data that carries additional context β such as document source, language, or category β previously had no way to pass that context through alongside the text. LightPipeline now supports passing metadata columns alongside text inputs in both annotate() and fullAnnotate(), enabling richer, context-aware inference for applications like routing, filtering, and conditional processing.
New supported call signatures:
fullAnnotate(text: str, metadata: dict[str, list[str]])fullAnnotate(texts: list[str], metadata: list[dict])fullAnnotate(texts: list[str], metadata: dict[str, list[str]])(columnar format)- Same patterns apply to
annotate()
Metadata can be passed as a keyword argument or as a positional trailing argument:
:::python
result = light_pipeline.fullAnnotate(
"U.N. official Ekeus heads for Baghdad.",
metadata={"source": ["news_article"]}
)
This feature is also surfaced through PretrainedPipeline.annotate() and PretrainedPipeline.fullAnnotate().
π Bug Fixes
- Apache POI upgraded to 5.4.1: The Apache POI dependency used by document readers has been upgraded from 4.1.2 to 5.4.1 (
poi-ooxml-full) to avoid deprecated dependencies.
β€οΈ Community Support
- Slack real-time discussion with the Spark NLP community and team
- GitHub issue tracking, feature requests, and contributions
- Discussions community ideas and showcases
- Medium latest Spark NLP articles and tutorials
- YouTube educational videos and demos
π» Installation
Python
:::bash
pip install spark-nlp==6.3.3
Spark Packages
CPU
:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.3.3
GPU
:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.3.3
Apple Silicon
:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.3.3
AArch64
:::bash
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.3.3
Maven
Supported on Apache Spark 3.x.
spark-nlp
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.3.3</version>
</dependency>
spark-nlp-gpu
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.3.3</version>
</dependency>
spark-nlp-silicon
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.3.3</version>
</dependency>
spark-nlp-aarch64
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.3.3</version>
</dependency>
FAT JARs
- CPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.3.3.jar
- GPU: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.3.3.jar
- Apple Silicon: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.3.3.jar
- AArch64: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.3.3.jar
What's Changed
- [SPARKNLP-1334] Updating POI dependency for readers [#14727] by @danilojsl
- [SPARKNLP-1335] Implement Layout Aligner annotators [#14737] by @danilojsl
- [SPARKNLP-1336] Enhancements to LightPipeline [#14734] by @danilojsl
- [SPARKNLP-1287] vector db connector annotator [#14729] by @ahmedlone127
- [SPARKNLP-1231] implement modern bert embeddings [#14736] by @ahmedlone127
- [SPARKNLP-1341] Add MultiColumnAssembler [#14743] by @AbdullahMubeenAnwar
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.3.2...6.3.3