Download Latest Version 6.0.5 source code.tar.gz (253.1 MB)
Email in envelope

Get an email when there's a new version of Spark NLP

Home / 6.0.3
Name Modified Size InfoDownloads / Week
Parent folder
6.0.3 source code.tar.gz 2025-06-11 252.4 MB
6.0.3 source code.zip 2025-06-11 433.6 MB
README.md 2025-06-11 7.5 kB
Totals: 3 Items   686.0 MB 0

πŸ“’ Spark NLP 6.0.3: Multimodal E5-V Embeddings and Enhanced Document Partitioning

We are excited to announce the release of Spark NLP 6.0.3! This version introduces significant advancements in multimodal capabilities and further refines document processing workflows. Upgrade to 6.0.3 to leverage these cutting-edge features and expand your NLP and vision task capabilities at scale.

πŸ”₯ Highlights

  • Introducing E5-V Universal Multimodal Embeddings: Support for E5VEmbeddings, enabling universal multimodal embeddings with Multimodal Large Language Models (MLLMs). It can express semantic similarly between texts, images, or a combination of both.
  • Enhanced Document Partitioning: Improvements to the Partition and PartitionTransformer annotators with new character and title-based chunking strategies.
  • New XML Reader: Added sparknlp.read().xml() and integrated XML support into the Partition annotator for streamlined XML document processing.

πŸš€ New Features & Enhancements

E5-V Multimodal Embeddings

This release further boosts Spark NLP's multimodal processing power with the integration of E5-V.

  • E5VEmbeddings is designed to adapt MLLMs for achieving universal multimodal embeddings. It leverages MLLMs with prompts to effectively bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. (Link to notebook)

Enhanced Unstructured Document Processing

The Partition and PartitionTransformer components now include additional chunking strategies and enhancements, which divides content into meaningful units based on the document's structure or number of characters.

  • New Chunking Strategies (Link to notebook)
  • Character Number Strategy (maxCharacters): Split documents by number of characters.
  • Title-Based Chunking Strategy (byTitle): Split documents by titles in the documents. Additional settings:
  • Soft Chunking Limit (newAfterNChars): Allows for early section breaks before reaching the maxCharacters threshold.
  • Contextual Overlap (overlapAll): Adds trailing context from the previous chunk to the next, improving semantic continuity.
  • Enhancements
  • Page Boundary Splitting: Respects pageNumber metadata and starts a new section when a page changes.
  • Title Inclusion Behavior: Ensures titles are embedded within the following content rather than forming isolated chunks.
  • New XML Reader: This release introduces a new feature that enables reading and parsing XML files into a structured Spark DataFrame. (Link to notebook)
    • Added sparknlp.read().xml(): This method accepts file paths of XML content.
    • Use in Partition: XML content can now be processed using the Partition annotator by setting content_type = "application/xml".

πŸ› Bug Fixes

❀️ Community Support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • JohnSnowLabs official Medium
  • YouTube Spark NLP video tutorials

βš™οΈ Installation

Python

:::sh
#PyPI
pip install spark-nlp==6.0.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3

GPU

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3

Apple Silicon

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3

AArch64

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-gpu:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-silicon:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

spark-nlp-aarch64:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.3</version>
</dependency>

FAT JARs

What's Changed

  • [SPARKNLP-1138] Adding semantic chunking to partition [#14593] by @danilojsl
  • [SPARKNLP-1163] Adding title chunking strategy [#14594] by @danilojsl
  • SparkNLP 1143 - Introducing e5-v universal embeddings with multimodal large language models [#14597] by @prabod
  • Fix reference copy pasted from Excel reader [#14591] by @thec0dewriter
  • [SPARKNLP-1119] Adding XML reader [#14598] by @danilojsl

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.0.2...6.0.3

Source: README.md, updated 2025-06-11