Scatter The Learning. Reduce The Time.
.ical

2025-09-12 10:00–13:00, Room 6

What does it take to move the process of training a neural network from a single device to multiple? The data interdependency and memory layout, which are handled easily in a simple, single-device scenario, need to now be moved into the distributed computing realm.

We'll take a simple network training example, and through first principles, introduce the basic primitives that, when supported by a distributed computing framework, enable spreading the computation over multiple nodes.

We'll cover the principles with the help of PyTorch. We'll use DDP/FSDP for describing the computation and introduce key fundamentals of the network primitives that enable DDP/FSDP.

Goal

Help participants arrive at the distributed training code setup through first principles. Use small a dataset and a small model to enable execution on a personal computing device, but with code that should extend to a capable, distributed hardware architecture too.

Process

We first set up and train a small network for a small dataset using PyTorch.
We introduce the problem of training a larger model and with a larger training dataset
We bring in the idea of "at least one more node" into the mix
Surface the challenge in splitting the training dataset and shuffling in a distributed manner
Introduce the audience to the need for (and mechanics of operation of) Torch's DistributedSampler
Move onwards to the theory of distributed training and the various primitives - broadcast, all-reduce, reduce, all-gather, reduce-scatter etc.
The above motivates the introduction of libraries like NCCL and gloo.
We will focus on gloo as it works on the CPU without any special hardware needs
Demonstrate an actual run in a distributed setup (use two processes on a single machine)
Stretch: We will demonstrate a run on multiple nodes with Nvidia cards using NCCL in the cloud at the end, on the same codebase with a few tweaks

Additional Resources –

https://pytorch.org/
https://github.com/pytorch/gloo

Prerequisites –

Setup instructions for the workshop can be found at https://github.com/jaju/pycon-2025-distributed-network-training/blob/main/README.md (note: code and other artifacts may change)

A basic understanding of how computation flows in neural networks. This will help them appreciate the challenges of distributing the compute and the data (both training data as well as weights and activations). Participants needn't have programmed neural networks.

Target Audience –

Intermediate

Ravindra Jaju

I consider myself as a generalist with a deep interest in software engineering and leveraging software to solve problems. I've worked in multiple domains, and in multiple roles as a technologist. My primary interest in the early days was in the areas of text mining and information retrieval, along with traditional NLP. Over the years, I happened to move across different domains, including advertising, large scale data processing, edtech and manufacturing as well as technical coaching and consulting. Of late, I've been looking to revisit my interest in the area of text and natural language processing through the lens of modern day deep learning and have been spending time on learning and using tools and techniques relevant for the same. My primary focus areas are SLMs and running them on edge or customer-grade devices.

Preethi Srinivasan

Preethi has an MS (by Research) from IIT Mandi and her thesis focused on Computer Vision specifically medical image post-processing. She is the first author on publications at ACCV, WiML, and IEEE CBMS. At Sahaj, she has built ML prototypes for video understanding, LLM fine-tuning, and RAG-based QA systems. Authored a blog series on LoRA and Intrinsic Dimension, which led to speaking engagements at PyCon India 2024 and The Fifth Elephant 2025.

This speaker also appears in:

Geometry of Efficient Fine Tuning: LoRA, Intrinsic Dimension & Subspace Learning

Scatter The Learning. Reduce The Time. .ical 2025-09-12 10:00–13:00, Room 6

Goal

Process

Scatter The Learning. Reduce The Time.
.ical

2025-09-12 10:00–13:00, Room 6