2025-09-12 –, Room 7
When developing large language model (LLM) applications, data preparation is far and away the most crucial and usually overlooked stage in the development process. Training data quality, its structure, and alignment are the most crucial factors to ensure model performance. Without spending the proper time at the outset to get the data preparation right, leads to poor workflows, prolonged time to develop the application, and develop models which do not meet your performance expectations.
In this workshop, participants will work with real-world problems pertaining to issues like duplication, variability, and noise in big data. Using the open-source Data Prep Kit (DPK), we will transform data by showing you how to clean, deduplicate, and structure data for LLM tasks, including how to build a Retrieval-Augmented Generation (RAG) chatbot. Each participant will leave the workshop with tangible experience and reusable workflows to accelerate development opportunities and better outcomes.
The session is planned in the following 3 sections:
Session Outline:
1. Establishing the Data Layer :
a. The significance of data quality when working with LLMs
b. Common data preparation difficulties faced when working with LLMs : Discuss the common challenges with data duplication, content noise, format related inconsistencies, and data preparation that will allow for a scalable range of data volumes.
c. Developing Better Foundations: Provide a brief overview of potential ways to prepare raw data into usable input for models fit for downstream use cases such as chatbots, summarization and retrieval augmented generation (RAG).
- Data Prep Kit (DPK): A Practical Toolkit for LLM Data Pipelines.
a. Introduction to DPK: An free and open-source Python library designed to simplify and scale the data preparation process for LLM based applications.
b. Practical application of Data Prep kit’s feature capabilities :
- How to extract usable data from various messy and inconsistent sources.
- Various prebuilt transformations like - Automatic deduplication of duplicate data, Scoring and filtering lower quality inputs and bring your own transformation.
- Engage in a lab session to build a RAG-powered chatbot—Allycat Chat—using Data Prep Kit (DPK). Experience how raw, dirty, real-world data can be converted into structured high-quality production ready inputs for LLM applications, and learn how to export seamlessly into a production grade architecture.
- How DPK accelerates time-to-model with high-speed processing.
- Takeaways & Q&A
- Learn about the benefits of open-source resources, how to leverage documentation, access to community support, & a live step by step demo of how to get started.
- Q&A will likely include an interactive discussion: Addressing questions and demoing more data preparation skills using DPK.
Python 3.12
VS Code
Set up environment
pip install dataprep kit
Package management: pip or condo
Supported OS: Linux, macOS, or Windows
Beginner
Additional Resources –I am an distinguished AI specialist and advocate based in Bengaluru, India, with over 12 years of experience in AI/ ML and product innovation. Currently, I am engaged as an Lead AI Advocate at IBM, where I help the adoption of AI technologies, including IBM's Granite models and WatsonX platform, across various industries. My role involves delivering technical content, promoting product adeptness, and offering insights on AI trends to influence strategies and solutions aligned with organizational goals. I am also pursuing PhD in Applied AI from IIITA.