Cleanlab - Browse /v2.7.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
README.md	2024-09-26	5.2 kB	0
v2.7.0 -- Broadening Data Quality Checks and ML Workflows source code.tar.gz	2024-09-26	1.3 MB	0
v2.7.0 -- Broadening Data Quality Checks and ML Workflows source code.zip	2024-09-26	1.5 MB	0
Totals: 3 Items		2.8 MB	0

This release introduces new features and improvements aimed at helping users detect complex dataset issues and improve their ML models' robustness. As always, we maintain backward compatibility, making this release non-breaking when upgrading from v2.6.6. We continue to support Python 3.8-3.11 in this version, but support for Python 3.8 will be dropped in a future minor release.

Introducing Spurious Correlation Detection in Datalab

With this release, Datalab now detects spurious correlations in image datasets by default, helping users identify potentially misleading patterns that may lead to overfitting or reduced model generalization.

Spurious correlations occur when models pick up on patterns in the data that are coincidental rather than meaningful. For example, a model might incorrectly associate the background color with a particular label, leading to poor generalization on new data. Identifying these correlations helps ensure more reliable models by minimizing the risk of learning from irrelevant or misleading features.

Detecting spurious correlations in image datasets is straightforward:

:::python
from cleanlab import Datalab

lab = Datalab(data=image_dataset, label_name="label_column", image_key="image_column")

lab.find_issues()

lab.report()

You can find a more detailed workflow for finding spurious correlations in our documentation.

This new issue type aims to give users deeper insights into their data, enabling more robust model development.

New Tutorial: Improving ML Performance with Train and Test Set Curation

We've introduced a new tutorial that demonstrates how to carefully use cleanlab (via Datalab) for both training and test data. This approach helps ensure reliable ML model training and evaluation, particularly for noisy datasets.

You can find this tutorial in our documentation: Improving ML Performance via Data Curation with Train vs Test Splits.

Other Major Improvements

Optimized Internal Functions: Several internal optimizations have been made, including updates to clip_noise_rates, remove_noise_from_class, and clip_values functions, improving the overall efficiency of cleanlab.
Improved Underperforming Group Detection: Enhanced scoring for all underperforming groups, providing more accurate identification of problematic data subsets.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Change Log

Significant changes in this release include:

Added Spurious Correlation feature by @allincowell in https://github.com/cleanlab/cleanlab/pull/1140, https://github.com/cleanlab/cleanlab/pull/1171, https://github.com/cleanlab/cleanlab/pull/1181, https://github.com/cleanlab/cleanlab/pull/1194; @elisno in https://github.com/cleanlab/cleanlab/pull/1170, https://github.com/cleanlab/cleanlab/pull/1192, https://github.com/cleanlab/cleanlab/pull/1192, https://github.com/cleanlab/cleanlab/pull/1193, https://github.com/cleanlab/cleanlab/pull/1192, https://github.com/cleanlab/cleanlab/pull/1201; @jwmueller in https://github.com/cleanlab/cleanlab/pull/1195, https://github.com/cleanlab/cleanlab/pull/1196
Added new CLOS train test split tutorial notebook by @mturk24 in https://github.com/cleanlab/cleanlab/pull/1071; @jwmueller in https://github.com/cleanlab/cleanlab/pull/1178
Update links to Issue Type Guide in workflows tutorials by @elisno in https://github.com/cleanlab/cleanlab/pull/1168
Optimize internal clip_noise_rates and remove_noise_from_class functions by @gogetron in https://github.com/cleanlab/cleanlab/pull/1105
Optimize internal clip_values function by @gogetron in https://github.com/cleanlab/cleanlab/pull/1104
Move models.fasttext wrapper to examples repo by @jwmueller in https://github.com/cleanlab/cleanlab/pull/1173
Mypy fixes by @elisno in https://github.com/cleanlab/cleanlab/pull/1174
Improve tests in Datalab Quickstart tutorial by @allincowell in https://github.com/cleanlab/cleanlab/pull/1166
Improve docs by @mturk24 in https://github.com/cleanlab/cleanlab/pull/1177; @jwmueller in https://github.com/cleanlab/cleanlab/pull/1189; @dduong1603 in https://github.com/cleanlab/cleanlab/pull/1197; @elisno in https://github.com/cleanlab/cleanlab/pull/1204
Update Studio References by @nelsonauner in https://github.com/cleanlab/cleanlab/pull/1182
Update README by @nelsonauner in https://github.com/cleanlab/cleanlab/pull/1188
Improve cluster score for all underperforming groups by @tataganesh in https://github.com/cleanlab/cleanlab/pull/1180
Improve CI test setup by @dduong1603 in https://github.com/cleanlab/cleanlab/pull/1198

New Contributors

@dduong1603 made their first contribution in https://github.com/cleanlab/cleanlab/pull/1197

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

Source: README.md, updated 2024-09-26

Cleanlab Files

The standard data-centric AI package for data quality and ML

Introducing Spurious Correlation Detection in Datalab

New Tutorial: Improving ML Performance with Train and Test Set Curation

Other Major Improvements

Change Log

New Contributors

Cleanlab Files

The standard data-centric AI package for data quality and ML

Get an email when there's a new version of Cleanlab

Introducing Spurious Correlation Detection in Datalab

New Tutorial: Improving ML Performance with Train and Test Set Curation

Other Major Improvements

Change Log

New Contributors