Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.
ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.
There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.
Today, we’re introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn’t change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.
In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.
Croissant
This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.
Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.
Why a shared format for ML data?
The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.
The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.
Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.
What can Croissant do today?
Today, users can find Croissant datasets at:
With a Croissant dataset, it is possible to:
To publish a Croissant dataset, users can:
- Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
- Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
- Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.
Future direction
We are excited about Croissant’s potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.
We encourage the community to join us in contributing to the effort.
Acknowledgements
Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.
As a programmer, I appreciate the effort to standardize ML dataset metadata. Croissant has the potential to simplify data preprocessing and feature engineering pipelines. However, it would be helpful if the metadata schema were available as a machine-readable format, such as JSON or YAML, to enable automated parsing and integration.
I’m not convinced that Croissant is the solution to all our ML dataset woes. Standardization is great, but it’s only part of the puzzle. We also need tools that address data quality, security, and governance. Croissant may not be the silver bullet we’re hoping for.
Croissant sounds a bit too technical for me. I’m not a data scientist, so I’m not sure how much it will directly benefit me. Maybe it’s more suited for advanced users.
I’m eager to try Croissant! The ability to standardize metadata for ML datasets sounds like a game-changer. It has frustrated me to deal with fragmented or missing metadata in the past. With Croissant, data discovery and understanding should become much smoother. Can’t wait to see how it simplifies the ML development process for me.
One potential concern with Croissant is its scalability. As the number of ML datasets grows, it will be important to ensure that Croissant can handle large-scale metadata management effectively. Additionally, it would be beneficial to have tools that can automatically generate Croissant metadata from existing datasets, reducing the manual effort involved.
Croissant has the potential to revolutionize the way we work with ML datasets. By standardizing metadata, Croissant enables us to unlock the full potential of our data and accelerate the development of more accurate and reliable ML models. I believe that Croissant will become an indispensable tool for the ML community.
I have used Croissant in several of my projects, and I have found it to be a valuable tool for managing and sharing ML datasets. The ability to standardize metadata has significantly improved the efficiency of our data discovery and collaboration processes. I highly recommend Croissant to other ML practitioners.
Croissant seems like a valuable tool for data engineers and scientists. The ability to capture essential information about datasets, including their properties, usage guidelines, and provenance, is crucial for effective data management. I’m excited to see how Croissant will contribute to the standardization and management of ML datasets.
Croissant aligns with the FAIR principles of data management: findability, accessibility, interoperability, and reusability. By providing a common metadata format, Croissant promotes data sharing and collaboration, fostering a more open and transparent data ecosystem.
Oh, Croissant, the metadata format we didn’t know we needed! It’s like putting a designer label on a pile of raw data. I’m sure it will make our datasets look fabulous, but will it actually improve the quality of our ML models?
I can’t help but imagine a Croissant as a metadata format. It’s like trying to fit a square peg into a round hole. Datasets are messy, and I’m not sure a fancy metadata format is going to make them any less so. Maybe we should just embrace the chaos!
Croissant? More like Croissant-t! This metadata format is just another layer of complexity that we don’t need. Why can’t we just use plain English to describe our datasets? It’s not rocket science!
Croissant sounds like a promising tool for data engineers and scientists. However, it’s important to consider its limitations and potential biases. Ensuring that metadata is accurate and representative is crucial. Otherwise, we risk perpetuating biases in our datasets and ML models.