The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
6.0.4 source code.tar.gz	2025-06-30	253.5 MB	0
6.0.4 source code.zip	2025-06-30	434.8 MB	0
README.md	2025-06-30	7.6 kB	0
Totals: 3 Items		688.4 MB	0

📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing

We are excited to announce the release of Spark NLP 6.0.4! This version brings advancements in text embeddings with the introduction of the MiniLM family, Spark DataFrame optimizations, and enhanced PDF document parsing. Upgrade to 6.0.4 to leverage these cutting-edge features and expand your NLP capabilities at scale.

Stay updated with our latest examples and tutorials by visiting our Medium - Spark NLP blog!

🔥 Highlights

Introducing MiniLMEmbeddings: Support for the efficient and powerful MiniLMEmbeddings models, providing state-of-the-art text representations.
New DataFrameOptimizer: A new DataFrameOptimizer transformer to streamline and optimize Spark DataFrame operations, offering configurable repartitioning, caching, and persistence options.
Advanced PDF Reader Features: Enhancements to the PDF Reader with extractCoordinates for spatial metadata, normalizeLigatures for improved text consistency, and a new exception column for enhanced fault tolerance.

🚀 New Features & Enhancements

Advanced Text Embeddings

This release introduces a new family of efficient text embedding models:

MiniLMEmbeddings: Support for the MiniLMEmbeddings annotator, enabling the use of MiniLM models for generating highly efficient and effective sentence embeddings. These models are designed to provide strong performance while being significantly smaller and faster than larger alternatives, making them ideal for a wide range of NLP tasks requiring compact and powerful text representations. (Link to notebook)

Spark DataFrame Optimization

DataFrameOptimizer: Introducing the new DataFrameOptimizer transformer, designed to enhance the performance and manageability of Spark DataFrames within your NLP pipelines. (Link to notebook)
Configurable Repartitioning: Allows for automatic repartitioning of DataFrames, ensuring optimal data distribution for downstream processing.
Optional Caching: Supports DataFrame caching (doCache) to significantly speed up iterative computations.
Persistent Output: Adds robust support for persisting DataFrames to disk in various formats (csv, json, parquet) with custom writer options via outputOptions.
Schema Preservation: Efficiently preserves the original DataFrame schema, making it a seamless utility for complex Spark NLP pipelines.

Enhanced PDF Document Processing

The PDF Reader and PdfToText transformer have been significantly improved for more comprehensive and fault-tolerant document parsing. (Link to notebook)

Spatial Metadata Extraction (extractCoordinates): A new configurable parameter extractCoordinates in PdfToText and the PDF Reader. When enabled, this outputs detailed spatial metadata (text position and dimensions) for each character in the PDF.
Ligature Normalization (normalizeLigatures): When extractCoordinates is enabled, the normalizeLigatures option ensures that ligature characters (e.g., ﬁ, ﬂ, œ) are automatically normalized to their decomposed forms (fi, fl, oe).
Fault Tolerance with Exception Column: A new exception output column has been introduced to capture and log any processing errors encountered while handling individual PDF documents.

❤️ Community Support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

⚙️ Installation

Python

:::sh
#PyPI
pip install spark-nlp==6.0.4

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.4

GPU

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.4

Apple Silicon

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.4

AArch64

:::sh
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.4

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-gpu:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-silicon:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

spark-nlp-aarch64:

:::xml
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.4</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.4.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.4.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.4.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.4.jar

What's Changed

Sparknlp 282 Introducing MiniLMEmbeddings [#14610] by @prabod
[SPARKNLP-1086] Introducing DataFrameOptimizer [#14607] by @danilojsl
[SPARKNLP-1161] Adding features to PDF Reader [#14596] by @danilojsl

Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/6.0.3...6.0.4

Source: README.md, updated 2025-06-30

Spark NLP Files

State of the Art Natural Language Processing

📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing

🔥 Highlights

🚀 New Features & Enhancements

Advanced Text Embeddings

Spark DataFrame Optimization

Enhanced PDF Document Processing

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed

Spark NLP Files

State of the Art Natural Language Processing

Get an email when there's a new version of Spark NLP

📢 Spark NLP 6.0.4: MiniLMEmbeddings, DataFrame Optimization, and Enhanced PDF Processing

🔥 Highlights

🚀 New Features & Enhancements

Advanced Text Embeddings

Spark DataFrame Optimization

Enhanced PDF Document Processing

❤️ Community Support

⚙️ Installation

Python

Spark Packages

Maven

FAT JARs

What's Changed