Abhijit Roy
I’m a Software Maintenance Engineer with a focus on quantization and fine-tuning large language models (LLMs). I work with tools and frameworks like vLLM, LLM Compressor, InstructLab, and RHEL AI to optimize and maintain high-performance AI systems.
My work revolves around making LLMs more efficient, scalable, and adaptable for real-world use cases—whether it’s reducing inference costs, enhancing model alignment, or supporting enterprise-grade AI deployments.
Session
LLMs like GPT-4 can consume as much energy per query as an entire web search session. What if we could cut that with python powered vLLM? In this session, we'll explore how vLLM, a Python-powered, high-throughput inference engine, enables green AI deployment by drastically improving GPU efficiency. We'll cover techniques like PagedAttention, continuous batching, and speculative decoding, showing how they reduce latency, memory overhead, and energy usage per token. Additionally, we'll dive into the role of the LLM Compressor, a lightweight compression framework that shrinks model size while preserving accuracy—further slashing inference costs and power consumption. If you're interested in sustainable LLM deployment, GPU optimization, or how Python can lead the charge in green computing, this talk is for you.