Ravindra Jaju
I consider myself as a generalist with a deep interest in software engineering and leveraging software to solve problems. I've worked in multiple domains, and in multiple roles as a technologist. My primary interest in the early days was in the areas of text mining and information retrieval, along with traditional NLP. Over the years, I happened to move across different domains, including advertising, large scale data processing, edtech and manufacturing as well as technical coaching and consulting. Of late, I've been looking to revisit my interest in the area of text and natural language processing through the lens of modern day deep learning and have been spending time on learning and using tools and techniques relevant for the same. My primary focus areas are SLMs and running them on edge or customer-grade devices.
Session
What does it take to move the process of training a neural network from a single device to multiple? The data interdependency and memory layout, which are handled easily in a simple, single-device scenario, need to now be moved into the distributed computing realm.
We'll take a simple network training example, and through first principles, introduce the basic primitives that, when supported by a distributed computing framework, enable spreading the computation over multiple nodes.
We'll cover the principles with the help of PyTorch, and libraries from Huggingface. We'll use DDP/FSDP for describing the computation and introduce fundamental primitives like reduce-scatter
, all-gather
and the likes that enable DDP/FSDP.