The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
6.0.3 source code.tar.gz	2025-06-11	252.4 MB	0
6.0.3 source code.zip	2025-06-11	433.6 MB	0
README.md	2025-06-11	7.5 kB	0
Totals: 3 Items		686.0 MB	0

📢 Spark NLP 6.0.3: Multimodal E5-V Embeddings and Enhanced Document Partitioning

We are excited to announce the release of Spark NLP 6.0.3! This version introduces significant advancements in multimodal capabilities and further refines document processing workflows. Upgrade to 6.0.3 to leverage these cutting-edge features and expand your NLP and vision task capabilities at scale.

🔥 Highlights

Introducing E5-V Universal Multimodal Embeddings: Support for E5VEmbeddings, enabling universal multimodal embeddings with Multimodal Large Language Models (MLLMs). It can express semantic similarly between texts, images, or a combination of both.
Enhanced Document Partitioning: Improvements to the Partition and PartitionTransformer annotators with new character and title-based chunking strategies.
New XML Reader: Added sparknlp.read().xml() and integrated XML support into the Partition annotator for streamlined XML document processing.

🚀 New Features & Enhancements

E5-V Multimodal Embeddings

This release further boosts Spark NLP's multimodal processing power with the integration of E5-V.

E5VEmbeddings is designed to adapt MLLMs for achieving universal multimodal embeddings. It leverages MLLMs with prompts to effectively bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. (Link to notebook)

Enhanced Unstructured Document Processing

The Partition and PartitionTransformer components now include additional chunking strategies and enhancements, which divides content into meaningful units based on the document's structure or number of characters.

New Chunking Strategies (Link to notebook)
Character Number Strategy (maxCharacters): Split documents by number of characters.
Title-Based Chunking Strategy (byTitle): Split documents by titles in the documents. Additional settings:
Soft Chunking Limit (newAfterNChars): Allows for early section breaks before reaching the maxCharacters threshold.
Contextual Overlap (overlapAll): Adds trailing context from the previous chunk to the next, improving semantic continuity.
Enhancements
Page Boundary Splitting: Respects pageNumber metadata and starts a new section when a page changes.
Title Inclusion Behavior: Ensures titles are embedded within the following content rather than forming isolated chunks.
New XML Reader: This release introduces a new feature that enables reading and parsing XML files into a structured Spark DataFrame. (Link to notebook)
- Added sparknlp.read().xml(): This method accepts file paths of XML content.
- Use in Partition: XML content can now be processed using the Partition annotator by setting content_type = "application/xml".

🐛 Bug Fixes

@thec0dewriter fixed a typo in our excel reader notebook (https://github.com/JohnSnowLabs/spark-nlp/pull/14591) Thanks a lot!

❤️ Community Support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

⚙️ Installation

Python

:::sh
#PyPI
pip install spark-nlp==6.0.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3

GPU

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3

Apple Silicon

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3

AArch64

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-gpu:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-silicon:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-aarch64:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.3.jar

What's Changed

[SPARKNLP-1138] Adding semantic chunking to partition [#14593] by @danilojsl
[SPARKNLP-1163] Adding title chunking strategy [#14594] by @danilojsl
SparkNLP 1143 - Introducing e5-v universal embeddings with multimodal large language models [#14597] by @prabod
Fix reference copy pasted from Excel reader [#14591] by @thec0dewriter
[SPARKNLP-1119] Adding XML reader [#14598] by @danilojsl

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.0.2...6.0.3

Source: README.md, updated 2025-06-11

Spark NLP Files

State of the Art Natural Language Processing

📢 Spark NLP 6.0.3: Multimodal E5-V Embeddings and Enhanced Document Partitioning

🔥 Highlights

🚀 New Features & Enhancements

E5-V Multimodal Embeddings

Enhanced Unstructured Document Processing

🐛 Bug Fixes

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Spark NLP Files

State of the Art Natural Language Processing

Get an email when there's a new version of Spark NLP

📢 Spark NLP 6.0.3: Multimodal E5-V Embeddings and Enhanced Document Partitioning

🔥 Highlights

🚀 New Features & Enhancements

E5-V Multimodal Embeddings

Enhanced Unstructured Document Processing

🐛 Bug Fixes

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed