PyCon India 2025

Scatter The Learning. Reduce The Time.
2025-09-12 , Room 6

What does it take to move the process of training a neural network from a single device to multiple? The data interdependency and memory layout, which are handled easily in a simple, single-device scenario, need to now be moved into the distributed computing realm.

We'll take a simple network training example, and through first principles, introduce the basic primitives that, when supported by a distributed computing framework, enable spreading the computation over multiple nodes.

We'll cover the principles with the help of PyTorch, and libraries from Huggingface. We'll use DDP/FSDP for describing the computation and introduce fundamental primitives like reduce-scatter, all-gather and the likes that enable DDP/FSDP.


Goal

Help participants arrive at the distributed training code setup through first principles. Use small a dataset and a small model to enable execution on a personal computing device, but with code that should extend to a capable, distributed hardware architecture too.

Process

  • We first set up and train a small network for a small dataset using PyTorch and Huggingface Transformers.
  • We introduce the problem of training a larger model and with a larger training dataset
  • We bring in the idea of "at least one more node" into the mix
  • Surface the challenge in splitting the training dataset and shuffling in a distributed manner
  • Introduce the audience to the need for (and mechanics of operation of) Torch's DistributedSampler
  • Move onwards to the theory of distributed training and the various primitives - broadcast, all-reduce, reduce, all-gather, reduce-scatter etc.
  • The above motivates the introduction of libraries like NCCL and gloo.
  • We will focus on gloo as it works on the CPU without any special hardware needs
  • Bring together the above into a single setup with the help of Huggingface's Accelerate toolkit
  • Demonstrate an actual run in a distributed setup (use two processes on a single machine)
  • Stretch: We will demonstrate a run on multiple nodes with Nvidia cards using NCCL in the cloud at the end, on the same codebase with a few tweaks

Prerequisites

A basic understanding of how computation flows in neural networks. This will help them appreciate the challenges of distributing the compute and the data (both training data as well as weights and activations). Participants needn't have programmed neural networks.

Additional Resources

https://pytorch.org/
https://huggingface.co/docs/transformers/index
https://huggingface.co/docs/transformers/accelerate
https://github.com/pytorch/gloo
https://engineering.fb.com/2021/07/15/open-source/fsdp/

Target Audience

Intermediate

I consider myself as a generalist with a deep interest in software engineering and leveraging software to solve problems. I've worked in multiple domains, and in multiple roles as a technologist. My primary interest in the early days was in the areas of text mining and information retrieval, along with traditional NLP. Over the years, I happened to move across different domains, including advertising, large scale data processing, edtech and manufacturing as well as technical coaching and consulting. Of late, I've been looking to revisit my interest in the area of text and natural language processing through the lens of modern day deep learning and have been spending time on learning and using tools and techniques relevant for the same. My primary focus areas are SLMs and running them on edge or customer-grade devices.

Preethi has an MS (by Research) from IIT Mandi and her thesis focused on Computer Vision specifically medical image post-processing. She is the first author on publications at ACCV, WiML, and IEEE CBMS. At Sahaj, she has built ML prototypes for video understanding, LLM fine-tuning, and RAG-based QA systems. Authored a blog series on LoRA and Intrinsic Dimension, which led to speaking engagements at PyCon India 2024 and The Fifth Elephant 2025.

This speaker also appears in: