Name | Modified | Size | Downloads / Week |
---|---|---|---|
Parent folder | |||
6.0.4 source code.tar.gz | 2025-06-30 | 253.5 MB | |
6.0.4 source code.zip | 2025-06-30 | 434.8 MB | |
README.md | 2025-06-30 | 7.6 kB | |
Totals: 3 Items | 688.4 MB | 0 |
📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing
We are excited to announce the release of Spark NLP 6.0.4! This version brings advancements in text embeddings with the introduction of the MiniLM family, Spark DataFrame optimizations, and enhanced PDF document parsing. Upgrade to 6.0.4 to leverage these cutting-edge features and expand your NLP capabilities at scale.
Stay updated with our latest examples and tutorials by visiting our Medium - Spark NLP blog!
🔥 Highlights
- Introducing MiniLMEmbeddings: Support for the efficient and powerful MiniLMEmbeddings models, providing state-of-the-art text representations.
- New DataFrameOptimizer: A new DataFrameOptimizer transformer to streamline and optimize Spark DataFrame operations, offering configurable repartitioning, caching, and persistence options.
- Advanced PDF Reader Features: Enhancements to the PDF Reader with extractCoordinates for spatial metadata, normalizeLigatures for improved text consistency, and a new exception column for enhanced fault tolerance.
🚀 New Features & Enhancements
Advanced Text Embeddings
This release introduces a new family of efficient text embedding models:
- MiniLMEmbeddings: Support for the
MiniLMEmbeddings
annotator, enabling the use of MiniLM models for generating highly efficient and effective sentence embeddings. These models are designed to provide strong performance while being significantly smaller and faster than larger alternatives, making them ideal for a wide range of NLP tasks requiring compact and powerful text representations. (Link to notebook)
Spark DataFrame Optimization
- DataFrameOptimizer: Introducing the new DataFrameOptimizer transformer, designed to enhance the performance and manageability of Spark DataFrames within your NLP pipelines. (Link to notebook)
- Configurable Repartitioning: Allows for automatic repartitioning of DataFrames, ensuring optimal data distribution for downstream processing.
- Optional Caching: Supports DataFrame caching (doCache) to significantly speed up iterative computations.
- Persistent Output: Adds robust support for persisting DataFrames to disk in various formats (csv, json, parquet) with custom writer options via outputOptions.
- Schema Preservation: Efficiently preserves the original DataFrame schema, making it a seamless utility for complex Spark NLP pipelines.
Enhanced PDF Document Processing
The PDF Reader and PdfToText transformer have been significantly improved for more comprehensive and fault-tolerant document parsing. (Link to notebook)
- Spatial Metadata Extraction (extractCoordinates): A new configurable parameter extractCoordinates in PdfToText and the PDF Reader. When enabled, this outputs detailed spatial metadata (text position and dimensions) for each character in the PDF.
- Ligature Normalization (normalizeLigatures): When extractCoordinates is enabled, the normalizeLigatures option ensures that ligature characters (e.g., fi, fl, œ) are automatically normalized to their decomposed forms (fi, fl, oe).
- Fault Tolerance with Exception Column: A new exception output column has been introduced to capture and log any processing errors encountered while handling individual PDF documents.
❤️ Community Support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
- Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
⚙️ Installation
Python
:::sh
#PyPI
pip install spark-nlp==6.0.4
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4
GPU
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4
Apple Silicon
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4
AArch64
:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.0.4</version>
</dependency>
spark-nlp-gpu:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.0.4</version>
</dependency>
spark-nlp-silicon:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.0.4</version>
</dependency>
spark-nlp-aarch64:
:::xml
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.0.4</version>
</dependency>
FAT JARs
- CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.4.jar
- GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.4.jar
- M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.4.jar
- AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.4.jar
What's Changed
- Sparknlp 282 Introducing MiniLMEmbeddings [#14610] by @prabod
- [SPARKNLP-1086] Introducing DataFrameOptimizer [#14607] by @danilojsl
- [SPARKNLP-1161] Adding features to PDF Reader [#14596] by @danilojsl
Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.0.3...6.0.4