Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
6.0.3 source code.tar.gz | 2025-06-11 | 252.4 MB | |
6.0.3 source code.zip | 2025-06-11 | 433.6 MB | |
README.md | 2025-06-11 | 7.5 kB | |
Totals: 3 Items | 686.0 MB | 0 |
π’ Spark NLP 6.0.3: Multimodal E5-V Embeddings and Enhanced Document Partitioning
We are excited to announce the release of Spark NLP 6.0.3! This version introduces significant advancements in multimodal capabilities and further refines document processing workflows. Upgrade to 6.0.3 to leverage these cutting-edge features and expand your NLP and vision task capabilities at scale.
π₯ Highlights
- Introducing E5-V Universal Multimodal Embeddings: Support for
E5VEmbeddings
, enabling universal multimodal embeddings with Multimodal Large Language Models (MLLMs). It can express semantic similarly between texts, images, or a combination of both. - Enhanced Document Partitioning: Improvements to the
Partition
andPartitionTransformer
annotators with new character and title-based chunking strategies. - New XML Reader: Added
sparknlp.read().xml()
and integrated XML support into thePartition
annotator for streamlined XML document processing.
π New Features & Enhancements
E5-V Multimodal Embeddings
This release further boosts Spark NLP's multimodal processing power with the integration of E5-V.
E5VEmbeddings
is designed to adapt MLLMs for achieving universal multimodal embeddings. It leverages MLLMs with prompts to effectively bridge the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. (Link to notebook)
Enhanced Unstructured Document Processing
The Partition
and PartitionTransformer
components now include additional chunking strategies and enhancements, which divides content into meaningful units based on the document's structure or number of characters.
- New Chunking Strategies (Link to notebook)
- Character Number Strategy (
maxCharacters
): Split documents by number of characters. - Title-Based Chunking Strategy (
byTitle
): Split documents by titles in the documents. Additional settings: - Soft Chunking Limit (
newAfterNChars
): Allows for early section breaks before reaching themaxCharacters
threshold. - Contextual Overlap (
overlapAll
): Adds trailing context from the previous chunk to the next, improving semantic continuity. - Enhancements
- Page Boundary Splitting: Respects
pageNumber
metadata and starts a new section when a page changes. - Title Inclusion Behavior: Ensures titles are embedded within the following content rather than forming isolated chunks.
- New XML Reader: This release introduces a new feature that enables reading and parsing XML files into a structured Spark DataFrame. (Link to notebook)
- Added
sparknlp.read().xml()
: This method accepts file paths of XML content. - Use in Partition: XML content can now be processed using the
Partition
annotator by settingcontent_type = "application/xml"
.
- Added
π Bug Fixes
- @thec0dewriter fixed a typo in our excel reader notebook (https://github.com/JohnSnowLabs/spark-nlp/pull/14591) Thanks a lot!
β€οΈ Community Support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
βοΈ Installation
Python
:::sh
#PyPI
pip install spark-nlp==6.0.3
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.3
GPU
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.3
Apple Silicon
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.3
AArch64
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.3
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.0.3</version>
</dependency>
spark-nlp-gpu:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.0.3</version>
</dependency>
spark-nlp-silicon:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.0.3</version>
</dependency>
spark-nlp-aarch64:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.0.3</version>
</dependency>
FAT JARs
- CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.3.jar
- GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.3.jar
- M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.3.jar
- AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.3.jar
What's Changed
- [SPARKNLP-1138] Adding semantic chunking to partition [#14593] by @danilojsl
- [SPARKNLP-1163] Adding title chunking strategy [#14594] by @danilojsl
- SparkNLP 1143 - Introducing e5-v universal embeddings with multimodal large language models [#14597] by @prabod
- Fix reference copy pasted from Excel reader [#14591] by @thec0dewriter
- [SPARKNLP-1119] Adding XML reader [#14598] by @danilojsl
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.0.2...6.0.3