Data-Centric AI is the process of building and testing AI systems by focusing on data-centric operations (i.e. cleaning, cleansing, pre-processing, balancing, augmentation) rather than model-centric operations (i.e. hyper-parameters selection, architectural changes).
Our summit takes place on two days: on Thu and Fri from 8 am PT / 11 am ET / 5 pm CET for 4 hours. Don't miss the second day!
From systems managing data to the data scientist or engineers working directly with the data itself, data profiles can standardize and automate ML development processes. This talk will describe the basics of data profiles as well as the use-cases within ML systems.
In this talk, Fabiana explains what is data preparation tuning, why it is so important in the light of Data-Centric AI approach, and how a data scientist benefit from it. She demonstrates YData Fabric — the platform geared for the data preparation — and showcases the idea of data preparation tuning on the example of the data-centric Credit Score modelling, through extensive data profiling, synthetic data generation and data improvement and experimentation pipelines.
Data-centric AI is bridging the gap between research and practice. Instead of optimizing our algorithms and architectures, pivoting to focus on data as the primary way to improve our machine learning models is yielding tremendous results. But this shift to data has left some gaps in our development process, and with this shift, we need to rethink how we develop AI from tooling to processes.
In this presentation join us as we examine:
Delivering the topic, Sergey will share his experience, best practices, and pitfalls of building data pipelines for search relevance evaluation. He will focus on a human-in-the-loop-based approach to obtain judgments at scale. These judgments can be further used to improve the performance of search engines.
Good data rather than big data: I show you practical examples on how you can improve your model by following Data-Centric AI principles. Modulos has developed a set of tools that let you locate the sources of error and bias on an individual sample level for maximum impact on model performance.
What are the key things that you need to know about setting up ML monitoring? Prof. Anupam Datta covers best practices concepts in just 10 minutes, including why keeping an eye on your data quality is critical.
Tracking embedded drift in your unstructured data.
There are many existing challenges with iterating on datasets to improve model performance: It is hard to find, access, and evaluate datasets across multiple data systems. It is hard to manage unstructured data. It is hard to re-purpose the data preparation and feature engineering work. Given the rise of Data-Centric AI, there is an emerging need for DataOps for unstructured data - best practices to systematically orchestrate data-centric operations such as data ingestion, data labeling, data quality assurance, data curation, and data augmentation.
This talk will: Touch on key principles of DataOps and their relevance for working with visual data; Propose a DataOps workflow for the modern computer vision stack; Examine the three challenges of data labeling, data curation, and data quality assurance in the context of developing real-world computer vision applications; Propose promising angles to address these challenges; and Suggest ways the AI Infrastructure ecosystem can elevate Data-Centric AI best practices by highlighting potential collaborations between ML data management tools.
Good data-centric AI starts with a good data management philosophy. It's about where best to store your data and how to version it. How to make it accessible by any machine and create visibility in not only what's in the dataset itself, but how it is used throughout your stack. Using the open source ClearML platform, we'll run down some of the most interesting and useful workflows to keep your data managed and healthy. And for those of you that know this already, we'll turn it up a level or two and introduce Hyperdatasets. Hyper-Datasets are an MLOps-oriented abstraction of your data, which facilitates parametrized data access and meta-data version control. It creates new data-centric opportunities like Hyperparameter optimization of the data itself, QA/QC pipelining and CD/CT (continuous training) during deployment.
Visual data management systems are lacking in all aspects: storage, quality (deduplication, anomaly detection), search, analytics and visualization. As a consequence, companies and researchers are losing product reliability, working hours, wasted storage, compute and most importantly, the ability to unlock the full potential of their data. To start addressing this we analyzed numerous state-of-the-art computer vision datasets and found that common problems such as corrupted images, outliers, wrong labels, and duplicated images can reach a level of up to 46%! As a first step in solving this problem, we introduce our simple free Python package, fastdup, that quickly and accurately computes statistics, detects outliers, duplicates and identifies wrong labels. In a couple of months since its inception, we got fantastic reactions from the computer vision community. Fastdup was installed more than 71,000 times and being used in production in several Fortune 500 companies.
Learn how to automate time-aware feature engineering, feature reduction using feature importance rank ensembling, and model training and tuning. With only 2-3 hours of work and 8-10 hours of compute, see how our 2022 March Madness Basketball predictions placed in the top 5% on Kaggle.
We introduce Lhotse, a Python library for handling speech data. Lhotse greatly simplifies common tasks around speech data preparation, preprocessing, and loading into machine learning workflows. In this talk we'll explain the motivation behind Lhotse, and its core concepts: how the data and the metadata are represented and stored, optimized data access, as well as Lhotse's pythonic API and integration with PyTorch. Finally, we give a glimpse into the larger k2 ecosystem that Lhotse is a part of.
In this session, we will describe the challenges in operationalizing machine & deep learning. We’ll explain the production-first approach to MLOps pipelines - using a modular strategy, where the different components provide a continuous, automated, and far simpler way to move from research and development to scalable production pipelines. Without the need to refactor code, add glue logic, and spend significant efforts on data and ML engineering. We will cover various real-world implementations and examples, and discuss the different stages, including automating feature creation using a feature store, building CI/CD automation for models and apps, deploying real-time application pipelines, observing the model and application results, creating a feedback loop and re-training with fresh data.
We’ll demonstrate how to use Iguazio & Snowflake to create a simple, seamless, and automated path to production at scale!
Data-centric AI doesn't just stop with cleaning and preparing data for model training - there are rich insights to be gleaned from production data. By analyzing, segmenting, and selectively re-labeling your production inference data, you can generate datasets for future model retraining. This talk will show you how you can use human-in-the-loop oversight to generate high-quality, labeled datasets from your prediction data for future model retraining.
From systems managing data to the data scientist or engineers working directly with the data itself, data profiles can standardize and automate ML development processes. This talk will describe the basics of data profiles as well as the use cases within ML systems.
An in-depth look at the current state of evaluation and its data-centric impact on downstream MLOps processes.
Today’s hottest role in data is the data engineer: building data platforms in the cloud that power digital ways of doing business, operating data pipelines transforming raw data oil into value, practicing new ways of data observability, powering models, … There are more tools and technologies to process data than ever before. Simultaneously there are more personas (data scientists, analysts, privacy folks, …) involved in the data process. And let’s not get started on regulations…. How can data teams best organize themselves to be successful in this modern data landscape? In this talk we’ll fly by the past decade in data, drawing inspiration and lessons for the current data wave. We’ll explore the data mesh which continues to grow in popularity for taming the data east across teams and departments. We’ll share our own data team's experiences, and give an outlook on what is up and coming in the exciting world of data.
This talk will briefly cover what bitmaps are, where you may have seen them before, and why they are something you should know about in the future. It will transition into a use-case breakdown that highlights how bitmaps enabled real-time segmentation from data collected across 6B devices.
The MLCommons Association aims to grow the ML ecosystem through benchmarks, best practices, and public datasets. Public datasets are vital to the future of ML: they fuel training, serve as benchmarks of progress, and enable research communication. MLCommons is committed to an open-source approach to creating public datasets that will enable the next decade of ML research. To date, MLCommons has created multiple new public datasets including the People's Speech -- a 30,000+ hour speech to text dataset -- and the Multilingual Spoken Words Corpus -- containing keywords in 50 languages. This talk will describe MLCommons approach to data and the People's Speech and Multilingual Spoken Words Corpus datasets.
The ML and MLOps community is talking about DCAI for good reason: issues in your dataset are a common cause for AI system failures and poor performance. However, new discussions on DCAI are most often focused on the experimental and model building phases of ML -- framing DCAI as a paradigm shift from focusing on model architectures and optimization to dataset iteration and quality to build a better model. This led to DCAI principles that don't work for production systems, such as removing data that hurts the overall data quality for your dataset. In this session, we'll discuss a number of DCAI principles that I've collected and describe practices and tools that allow us to make them feasible for production systems at massive scale, streaming data, and real world data drift where possible.
Machine learning models have a peculiar vulnerability where a small perturbation of data may cause a model to misclassify. Robustness is a measure of how resilient AI models are against small, targeted distortions. Hewlett Packard Enterprise is using ML techniques and tools to analyze and synthesize ML models for the “robustness”. Join this session to learn how to enhance and create resilient AI models.
Building AI infrastructure is complex! With AI getting more and more traction organisations are looking at building the right infrastructure for AI. The focus however is put on finding the right tools and services to start the experimentation and training process of the AI development lifecycle. With a continuously evolving ecosystem and new practices like MLOps it is likely that companies will choose multiple tools to fit their needs, these tools all have 1 thing in common... the need for compute. Learn how Run:ai Atlas lets you build a solid compute foundation for your AI infrastructure.
One out of three ML projects fail due to the lack of a solid data foundation. Projects suffer from low-quality data, under-utilized compute resources, and significant labor overhead required to build and maintain large amounts of data. For projects involving tabular data, traditional data lakes provide critical features such as time traveling, SQL queries, ingesting data with ACID transactions, and visualizing terabyte-scale datasets for analytical workloads. These features break down data silos, enable data-driven decision making, improve operational efficiency, and reduce costs across organizations. However, most of these features are not available for deep learning workloads. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data such as images, videos, annotations, as well as tabular data, as columns, and it rapidly streams the data to deep learning frameworks without sacrificing GPU utilization. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake is becoming the new norm.
Machine Learning (ML) is increasingly used to make business-critical decisions across multiple industries. The surge of deep learning has further accelerated this data across multiple data modalities. But unlike traditional software, which have well-defined standards and practices, ML systems lack a systematic way to measure the full spectrum of model quality.
In this talk, we will dive deeper into what ML Data Intelligence means and how it solves the Data Quality problem in Machine Learning, the primary determinant of Model Quality. We'll be discussing the key principles, techniques and indicators employed in curating high quality, error-free datasets for ML and productionizing high quality models.
We will also talk about the key issues faced plaguing ML data quality across the industry, the standard techniques used to identify and fix errors and key metrics monitored - where they fall short, how they can be improved to greatly improve model performance in production.
The talk is based on my past experiences as a Staff Software Engineer and Tech Lead for Uber's Machine Learning Platform (Michelangelo) and leading ML Data Quality at Uber AI, as well as through my journey as founder of Galileo - seeing the big problems faced in the industry in this field.