AI Infrastructure Alliance
timezone
+00:00 GMT
SIGN IN
  • Home
  • Events
  • Content
  • People
  • Messages
  • Channels
  • Help
Sign In
Data-Centric AI Summit
Data-Centric AI is the process of building and testing AI systems by focusing on data-centric operations (i.e. cleaning, cleansing, pre-processing, balancing, augmentation) rather than model-centric operations (i.e. hyper-parameters selection, architectural changes).

Our summit takes place on two days: on Thu and Fri from 8 am PT / 11 am ET / 5 pm CET for 4 hours. Don't miss the second day!

Speakers
Christoph Schuhmann
Christoph Schuhmann
Organizational Lead / Founder @ LAION
Daniel Jeffries
Daniel Jeffries
Managing Director @ AI Infrastructure Alliance
Peter Mattson
Peter Mattson
ML Metric Lead @ Google
Kai Yang
Kai Yang
VP of Product @ Landing AI
Joe Reis
Joe Reis
CEO @ Ternary Data
Davit Buniatyan
Davit Buniatyan
CEO @ Activeloop
Kevin Schawinski
Kevin Schawinski
CEO & Co-founder @ Modulos
Bernease Herman
Bernease Herman
Sr. Data Scientist @ WhyLabs
Fabiana Clemente
Fabiana Clemente
CDO @ YData
Victor Sonck
Victor Sonck
Evangelist @ ClearML
Andrea Kropp
Andrea Kropp
Customer-Facing Data Scientist at DataRobot @ DataRobot
Jimmy Whitaker
Jimmy Whitaker
Chief Scientist of AI @ Pachyderm
Atalia Horenshtien
Atalia Horenshtien
Global Technical Product Advocacy Lead @ DataRobot
Seth Clark
Seth Clark
Head of Product & Co-founder @ Modzy
Amber Roberts
Amber Roberts
Machine Learning Engineer @ Arize
Greg Throne
Greg Throne
Technical Product Manager @ Molecula
Danny Bickson
Danny Bickson
CEO @ Visual Layer
Atindriyo Sanyal
Atindriyo Sanyal
CTO and Founder @ Galileo
Sergey Koshelev
Sergey Koshelev
Sales Engineer & Crowd Solutions Architect @ Toloka
Goku Mohandas
Goku Mohandas
Founder @ Made With ML
Nicholas Schenone
Nicholas Schenone
Pre-sales Engineer, MLOps @ Iguazio
Anupam Datta
Anupam Datta
President, Chief Scientist, and cofounder @ TruEra
James Le
James Le
Data Advocate and Partnerships Lead @ Superb.ai
Mark Mazumder
Mark Mazumder
PhD Candidate, Harvard University @ Harvard University
Daniel Galvez
Daniel Galvez
AI Developer Technology Engineer, NVIDIA @ NVIDIA
Jeremy Goodsitt
Jeremy Goodsitt
Senior Manager, Machine Learning Engineer @ Capital One
Pardhu Gunnam
Pardhu Gunnam
CEO/Co-founder @ Metaphor.io
Piotr Zelasko
Piotr Zelasko
Head of Research @ Meaning
Stijn “Stan” Christiaens
Stijn “Stan” Christiaens
Founder, Chief Data Citizen @ Collibra
Joe Doliner
Joe Doliner
CEO & Co-founder @ Pachyderm
Hyun Kim
Hyun Kim
CEO and Co-founder @ Superb AI
Higinio ("H.O.") Maycotte
Higinio ("H.O.") Maycotte
CEO @ FeatureBase
Chad Sanderson
Chad Sanderson
Head of Product @ Convoy Inc
Fedor Zhdanov
Fedor Zhdanov
Head of Machine Learning @ Toloka
Alon Gubkin
Alon Gubkin
CTO @ Aporia
Soumyendu Sarkar
Soumyendu Sarkar
Senior Director and Senior Distinguished Technologist @ Hewlett Packard Enterprise
Gijsbert Janssen van Doorn
Gijsbert Janssen van Doorn
Director Technical Product Marketing @ Run:AI
Gonçalo Martins Ribeiro
Gonçalo Martins Ribeiro
CEO @ YData
Ce Zhang
Ce Zhang
Assistant Professor @ ETH Zurich
Mario Figueiredo
Mario Figueiredo
Distinguished Professor and Feedzai Professor of Machine Learning @ Instituto Superior Técnico
Stavros Zervoudakis
Stavros Zervoudakis
Head of Data Science, Machine Learning and Advanced Analytics @ MOA
Frank Chang
Frank Chang
Co-Founder and Managing Partner @ Flying Fish Partners
James Alcorn
James Alcorn
Partner @ Zetta Venture Partners
Assaf Araki
Assaf Araki
Investment Manager @ Intel Capital
Masamba Senghore
Masamba Senghore
Investor @ MMC Ventures
Christoph Schuhmann
Christoph Schuhmann
Organizational Lead / Founder @ LAION
Daniel Jeffries
Daniel Jeffries
Managing Director @ AI Infrastructure Alliance
Peter Mattson
Peter Mattson
ML Metric Lead @ Google
Kai Yang
Kai Yang
VP of Product @ Landing AI
Joe Reis
Joe Reis
CEO @ Ternary Data
Davit Buniatyan
Davit Buniatyan
CEO @ Activeloop
Kevin Schawinski
Kevin Schawinski
CEO & Co-founder @ Modulos
Bernease Herman
Bernease Herman
Sr. Data Scientist @ WhyLabs
Fabiana Clemente
Fabiana Clemente
CDO @ YData
Victor Sonck
Victor Sonck
Evangelist @ ClearML
Andrea Kropp
Andrea Kropp
Customer-Facing Data Scientist at DataRobot @ DataRobot
Jimmy Whitaker
Jimmy Whitaker
Chief Scientist of AI @ Pachyderm
Atalia Horenshtien
Atalia Horenshtien
Global Technical Product Advocacy Lead @ DataRobot
Seth Clark
Seth Clark
Head of Product & Co-founder @ Modzy
Amber Roberts
Amber Roberts
Machine Learning Engineer @ Arize
Greg Throne
Greg Throne
Technical Product Manager @ Molecula
Danny Bickson
Danny Bickson
CEO @ Visual Layer
Atindriyo Sanyal
Atindriyo Sanyal
CTO and Founder @ Galileo
Sergey Koshelev
Sergey Koshelev
Sales Engineer & Crowd Solutions Architect @ Toloka
Goku Mohandas
Goku Mohandas
Founder @ Made With ML
Nicholas Schenone
Nicholas Schenone
Pre-sales Engineer, MLOps @ Iguazio
Anupam Datta
Anupam Datta
President, Chief Scientist, and cofounder @ TruEra
James Le
James Le
Data Advocate and Partnerships Lead @ Superb.ai
Mark Mazumder
Mark Mazumder
PhD Candidate, Harvard University @ Harvard University
Daniel Galvez
Daniel Galvez
AI Developer Technology Engineer, NVIDIA @ NVIDIA
Jeremy Goodsitt
Jeremy Goodsitt
Senior Manager, Machine Learning Engineer @ Capital One
Pardhu Gunnam
Pardhu Gunnam
CEO/Co-founder @ Metaphor.io
Piotr Zelasko
Piotr Zelasko
Head of Research @ Meaning
Stijn “Stan” Christiaens
Stijn “Stan” Christiaens
Founder, Chief Data Citizen @ Collibra
Joe Doliner
Joe Doliner
CEO & Co-founder @ Pachyderm
Hyun Kim
Hyun Kim
CEO and Co-founder @ Superb AI
Higinio ("H.O.") Maycotte
Higinio ("H.O.") Maycotte
CEO @ FeatureBase
Chad Sanderson
Chad Sanderson
Head of Product @ Convoy Inc
Fedor Zhdanov
Fedor Zhdanov
Head of Machine Learning @ Toloka
Alon Gubkin
Alon Gubkin
CTO @ Aporia
Soumyendu Sarkar
Soumyendu Sarkar
Senior Director and Senior Distinguished Technologist @ Hewlett Packard Enterprise
Gijsbert Janssen van Doorn
Gijsbert Janssen van Doorn
Director Technical Product Marketing @ Run:AI
Gonçalo Martins Ribeiro
Gonçalo Martins Ribeiro
CEO @ YData
Ce Zhang
Ce Zhang
Assistant Professor @ ETH Zurich
Mario Figueiredo
Mario Figueiredo
Distinguished Professor and Feedzai Professor of Machine Learning @ Instituto Superior Técnico
Stavros Zervoudakis
Stavros Zervoudakis
Head of Data Science, Machine Learning and Advanced Analytics @ MOA
Frank Chang
Frank Chang
Co-Founder and Managing Partner @ Flying Fish Partners
James Alcorn
James Alcorn
Partner @ Zetta Venture Partners
Assaf Araki
Assaf Araki
Investment Manager @ Intel Capital
Masamba Senghore
Masamba Senghore
Investor @ MMC Ventures
Agenda
Day 1
Day 2
Track 1
Track 2
Track 3
3:00 PM
3:25 PM
Presentation

Solving Data Discovery with Modern Metadata Platform

From systems managing data to the data scientist or engineers working directly with the data itself, data profiles can standardize and automate ML development processes. This talk will describe the basics of data profiles as well as the use-cases within ML systems.

+ Read More
Pardhu Gunnam
3:00 PM
3:45 PM
Presentation

Hands-on Data-Centric AI: Data preparation tuning - why and how?

In this talk, Fabiana explains what is data preparation tuning, why it is so important in the light of Data-Centric AI approach, and how a data scientist benefit from it. She demonstrates YData Fabric — the platform geared for the data preparation — and showcases the idea of data preparation tuning on the example of the data-centric Credit Score modelling, through extensive data profiling, synthetic data generation and data improvement and experimentation pipelines.

+ Read More
Fabiana Clemente
3:00 PM
4:00 PM
Panel Discussion

Production Data-Centric AI and What It Means

Chad Sanderson
Atalia Horenshtien
Fedor Zhdanov
Anupam Datta
3:25 PM
4:00 PM
Presentation

Rethinking ML Development - A Data-Centric Approach

Data-centric AI is bridging the gap between research and practice. Instead of optimizing our algorithms and architectures, pivoting to focus on data as the primary way to improve our machine learning models is yielding tremendous results. But this shift to data has left some gaps in our development process, and with this shift, we need to rethink how we develop AI from tooling to processes.

In this presentation join us as we examine:

  • Data Centric AI and how did we get here?
  • Data as the new ‘Source Code’
  • What are the practical steps towards Data Centric AI
+ Read More
Jimmy Whitaker
3:45 PM
4:00 PM
Presentation

AI – Moving Beyond the Software Industry

Kai Yang
4:00 PM
4:20 PM
1:1 networking

Networking

4:00 PM
4:20 PM
1:1 networking

Networking

4:00 PM
4:20 PM
1:1 networking

Networking

4:20 PM
4:35 PM
Presentation

Search relevance evaluation using crowdsourcing

Delivering the topic, Sergey will share his experience, best practices, and pitfalls of building data pipelines for search relevance evaluation. He will focus on a human-in-the-loop-based approach to obtain judgments at scale. These judgments can be further used to improve the performance of search engines.

+ Read More
Sergey Koshelev
4:20 PM
5:00 PM
Presentation

Democratizing AI: Mastering the Massive Open Datasets that Power Imagen and Stable Diffusion

Christoph Schuhmann
4:20 PM
5:10 PM
Presentation

How to build better and fairer models with Data-Centric AI

Good data rather than big data: I show you practical examples on how you can improve your model by following Data-Centric AI principles. Modulos has developed a set of tools that let you locate the sources of error and bias on an individual sample level for maximum impact on model performance.

+ Read More
Kevin Schawinski
4:35 PM
4:45 PM
Presentation

Fundamentals of ML Monitoring in 10 Minutes

What are the key things that you need to know about setting up ML monitoring? Prof. Anupam Datta covers best practices concepts in just 10 minutes, including why keeping an eye on your data quality is critical.

+ Read More
Anupam Datta
4:45 PM
4:55 PM
Presentation

Data-Centric AI for Unstructured Use Cases

Tracking embedded drift in your unstructured data.

+ Read More
Amber Roberts
4:55 PM
5:10 PM
Presentation

Data Curation for Computer Vision 101

There are many existing challenges with iterating on datasets to improve model performance: It is hard to find, access, and evaluate datasets across multiple data systems. It is hard to manage unstructured data. It is hard to re-purpose the data preparation and feature engineering work. Given the rise of Data-Centric AI, there is an emerging need for DataOps for unstructured data - best practices to systematically orchestrate data-centric operations such as data ingestion, data labeling, data quality assurance, data curation, and data augmentation.

This talk will: Touch on key principles of DataOps and their relevance for working with visual data; Propose a DataOps workflow for the modern computer vision stack; Examine the three challenges of data labeling, data curation, and data quality assurance in the context of developing real-world computer vision applications; Propose promising angles to address these challenges; and Suggest ways the AI Infrastructure ecosystem can elevate Data-Centric AI best practices by highlighting potential collaborations between ML data management tools.

+ Read More
James Le
5:00 PM
5:30 PM
Presentation

From a simple CLI tool to a powerful platform: your data management options in plain english.

Good data-centric AI starts with a good data management philosophy. It's about where best to store your data and how to version it. How to make it accessible by any machine and create visibility in not only what's in the dataset itself, but how it is used throughout your stack. Using the open source ClearML platform, we'll run down some of the most interesting and useful workflows to keep your data managed and healthy. And for those of you that know this already, we'll turn it up a level or two and introduce Hyperdatasets. Hyper-Datasets are an MLOps-oriented abstraction of your data, which facilitates parametrized data access and meta-data version control. It creates new data-centric opportunities like Hyperparameter optimization of the data itself, QA/QC pipelining and CD/CT (continuous training) during deployment.

+ Read More
Victor Sonck
5:10 PM
5:25 PM
Presentation

Large Image Datasets are a Mess

Visual data management systems are lacking in all aspects: storage, quality (deduplication, anomaly detection), search, analytics and visualization. As a consequence, companies and researchers are losing product reliability, working hours, wasted storage, compute and most importantly, the ability to unlock the full potential of their data. To start addressing this we analyzed numerous state-of-the-art computer vision datasets and found that common problems such as corrupted images, outliers, wrong labels, and duplicated images can reach a level of up to 46%! As a first step in solving this problem, we introduce our simple free Python package, fastdup, that quickly and accurately computes statistics, detects outliers, duplicates and identifies wrong labels. In a couple of months since its inception, we got fantastic reactions from the computer vision community. Fastdup was installed more than 71,000 times and being used in production in several Fortune 500 companies.

+ Read More
Danny Bickson
5:10 PM
5:40 PM
Presentation

Extreme ML Automation: A March Madness Case Study

Learn how to automate time-aware feature engineering, feature reduction using feature importance rank ensembling, and model training and tuning. With only 2-3 hours of work and 8-10 hours of compute, see how our 2022 March Madness Basketball predictions placed in the top 5% on Kaggle.

+ Read More
Andrea Kropp
5:25 PM
5:40 PM
Presentation

Lhotse: A speech data representation library for the modern deep learning ecosystem

We introduce Lhotse, a Python library for handling speech data. Lhotse greatly simplifies common tasks around speech data preparation, preprocessing, and loading into machine learning workflows. In this talk we'll explain the motivation behind Lhotse, and its core concepts: how the data and the metadata are represented and stored, optimized data access, as well as Lhotse's pythonic API and integration with PyTorch. Finally, we give a glimpse into the larger k2 ecosystem that Lhotse is a part of.

+ Read More
Piotr Zelasko
5:30 PM
5:40 PM
Presentation

Transforming Snowflake into an MLOps ‘Feature Factory’ using Iguazio

In this session, we will describe the challenges in operationalizing machine & deep learning. We’ll explain the production-first approach to MLOps pipelines - using a modular strategy, where the different components provide a continuous, automated, and far simpler way to move from research and development to scalable production pipelines. Without the need to refactor code, add glue logic, and spend significant efforts on data and ML engineering. We will cover various real-world implementations and examples, and discuss the different stages, including automating feature creation using a feature store, building CI/CD automation for models and apps, deploying real-time application pipelines, observing the model and application results, creating a feedback loop and re-training with fresh data.

We’ll demonstrate how to use Iguazio & Snowflake to create a simple, seamless, and automated path to production at scale!

+ Read More
Nicholas Schenone
5:40 PM
6:00 PM
1:1 networking

Networking

5:40 PM
6:00 PM
1:1 networking

Networking

5:40 PM
6:00 PM
1:1 networking

Networking

6:00 PM
6:15 PM
Presentation

Leveraging Prediction Data for Model Retraining

Data-centric AI doesn't just stop with cleaning and preparing data for model training - there are rich insights to be gleaned from production data. By analyzing, segmenting, and selectively re-labeling your production inference data, you can generate datasets for future model retraining. This talk will show you how you can use human-in-the-loop oversight to generate high-quality, labeled datasets from your prediction data for future model retraining.

+ Read More
Seth Clark
6:00 PM
6:30 PM
Presentation

Enhancing ML with Data Profiles

From systems managing data to the data scientist or engineers working directly with the data itself, data profiles can standardize and automate ML development processes. This talk will describe the basics of data profiles as well as the use cases within ML systems.

+ Read More
Jeremy Goodsitt
6:00 PM
7:00 PM
Presentation

An Epic Overview of the AI Infrastructure Ecosystem of 2022

Daniel Jeffries
6:15 PM
6:45 PM
Presentation

Evaluating Machine Learning Models: Industry Adoption and Research Trends

An in-depth look at the current state of evaluation and its data-centric impact on downstream MLOps processes.

+ Read More
Goku Mohandas
6:30 PM
7:00 PM
Presentation

The Data Mesh: A new hope in the modern data stack

Today’s hottest role in data is the data engineer: building data platforms in the cloud that power digital ways of doing business, operating data pipelines transforming raw data oil into value, practicing new ways of data observability, powering models, … There are more tools and technologies to process data than ever before. Simultaneously there are more personas (data scientists, analysts, privacy folks, …) involved in the data process. And let’s not get started on regulations…. How can data teams best organize themselves to be successful in this modern data landscape? In this talk we’ll fly by the past decade in data, drawing inspiration and lessons for the current data wave. We’ll explore the data mesh which continues to grow in popularity for taming the data east across teams and departments. We’ll share our own data team's experiences, and give an outlook on what is up and coming in the exciting world of data.

+ Read More
Stijn “Stan” Christiaens
6:45 PM
6:55 PM
Presentation

Bitmaps: A Tale Of Handling Scale Without Having To Scale

This talk will briefly cover what bitmaps are, where you may have seen them before, and why they are something you should know about in the future. It will transition into a use-case breakdown that highlights how bitmaps enabled real-time segmentation from data collected across 6B devices.

+ Read More
Greg Throne
3:00 PM
3:50 PM
Presentation

MLCommons and Public Data

The MLCommons Association aims to grow the ML ecosystem through benchmarks, best practices, and public datasets. Public datasets are vital to the future of ML: they fuel training, serve as benchmarks of progress, and enable research communication. MLCommons is committed to an open-source approach to creating public datasets that will enable the next decade of ML research. To date, MLCommons has created multiple new public datasets including the People's Speech -- a 30,000+ hour speech to text dataset -- and the Multilingual Spoken Words Corpus -- containing keywords in 50 languages. This talk will describe MLCommons approach to data and the People's Speech and Multilingual Spoken Words Corpus datasets.

+ Read More
Mark Mazumder
Daniel Galvez
Peter Mattson
3:00 PM
4:00 PM
Presentation

Can we adapt experimental data-centric AI principles for production ML systems?

The ML and MLOps community is talking about DCAI for good reason: issues in your dataset are a common cause for AI system failures and poor performance. However, new discussions on DCAI are most often focused on the experimental and model building phases of ML -- framing DCAI as a paradigm shift from focusing on model architectures and optimization to dataset iteration and quality to build a better model. This led to DCAI principles that don't work for production systems, such as removing data that hurts the overall data quality for your dataset. In this session, we'll discuss a number of DCAI principles that I've collected and describe practices and tools that allow us to make them feasible for production systems at massive scale, streaming data, and real world data drift where possible.

+ Read More
Bernease Herman
3:50 PM
4:00 PM
Presentation

Why Cloud Native MLOps Rocks

Alon Gubkin
4:00 PM
4:20 PM
1:1 networking

Networking

4:00 PM
4:20 PM
1:1 networking

Networking

4:20 PM
4:45 PM
Presentation

Refining ML models with ML

Machine learning models have a peculiar vulnerability where a small perturbation of data may cause a model to misclassify. Robustness is a measure of how resilient AI models are against small, targeted distortions. Hewlett Packard Enterprise is using ML techniques and tools to analyze and synthesize ML models for the “robustness”. Join this session to learn how to enhance and create resilient AI models.

+ Read More
Soumyendu Sarkar
4:20 PM
5:15 PM
Panel Discussion

Wrangling Big Datasets in the Real World

Daniel Jeffries
Joe Doliner
Hyun Kim
Higinio ("H.O.") Maycotte
4:45 PM
5:30 PM
Panel Discussion

Investments trends for Data-Centric AI domain

Gonçalo Martins Ribeiro
Frank Chang
James Alcorn
Assaf Araki
Masamba Senghore
5:15 PM
5:40 PM
Presentation

Data Engineering is the Foundation of Data-Centric AI

Joe Reis
5:30 PM
5:40 PM
Presentation

Building the compute foundation for your AI infrastructure

Building AI infrastructure is complex! With AI getting more and more traction organisations are looking at building the right infrastructure for AI. The focus however is put on finding the right tools and services to start the experimentation and training process of the AI development lifecycle. With a continuously evolving ecosystem and new practices like MLOps it is likely that companies will choose multiple tools to fit their needs, these tools all have 1 thing in common... the need for compute. Learn how Run:ai Atlas lets you build a solid compute foundation for your AI infrastructure.

+ Read More
Gijsbert Janssen van Doorn
5:40 PM
6:00 PM
1:1 networking

Networking

5:40 PM
6:00 PM
1:1 networking

Networking

6:00 PM
6:40 PM
Presentation

Deep Lake: Optimizing Data Lakes for Deep Learning

One out of three ML projects fail due to the lack of a solid data foundation. Projects suffer from low-quality data, under-utilized compute resources, and significant labor overhead required to build and maintain large amounts of data. For projects involving tabular data, traditional data lakes provide critical features such as time traveling, SQL queries, ingesting data with ACID transactions, and visualizing terabyte-scale datasets for analytical workloads. These features break down data silos, enable data-driven decision making, improve operational efficiency, and reduce costs across organizations. However, most of these features are not available for deep learning workloads. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data such as images, videos, annotations, as well as tabular data, as columns, and it rapidly streams the data to deep learning frameworks without sacrificing GPU utilization. As deep learning rapidly takes over traditional computational pipelines, storing datasets in a Deep Lake is becoming the new norm.

+ Read More
Davit Buniatyan
6:00 PM
6:50 PM
Presentation

Data-Centric AI Community

Fabiana Clemente
6:40 PM
7:25 PM
Panel Discussion

The Impact of Academia on AI/ML Innovation

Gonçalo Martins Ribeiro
Ce Zhang
Mario Figueiredo
Stavros Zervoudakis
6:50 PM
7:20 PM
Presentation

Principles for Building High Quality Models Using High Quality Data at Scale

Machine Learning (ML) is increasingly used to make business-critical decisions across multiple industries. The surge of deep learning has further accelerated this data across multiple data modalities. But unlike traditional software, which have well-defined standards and practices, ML systems lack a systematic way to measure the full spectrum of model quality.

In this talk, we will dive deeper into what ML Data Intelligence means and how it solves the Data Quality problem in Machine Learning, the primary determinant of Model Quality. We'll be discussing the key principles, techniques and indicators employed in curating high quality, error-free datasets for ML and productionizing high quality models.

We will also talk about the key issues faced plaguing ML data quality across the industry, the standard techniques used to identify and fix errors and key metrics monitored - where they fall short, how they can be improved to greatly improve model performance in production.

The talk is based on my past experiences as a Staff Software Engineer and Tech Lead for Uber's Machine Learning Platform (Michelangelo) and leading ML Data Quality at Uber AI, as well as through my journey as founder of Galileo - seeing the big problems faced in the industry in this field.

+ Read More
Atindriyo Sanyal
Event has finished
September 29, 3:00 PM, GMT
Online
Hosted by
AI Infrastructure Alliance
AI Infrastructure Alliance
Event has finished
September 29, 3:00 PM, GMT
Online
Hosted by
AI Infrastructure Alliance
AI Infrastructure Alliance