2025-09-14 –, Track 3
LLMs like GPT-4 can consume as much energy per query as an entire web search session. What if we could cut that with python powered vLLM? In this session, we'll explore how vLLM, a Python-powered, high-throughput inference engine, enables green AI deployment by drastically improving GPU efficiency. We'll cover techniques like PagedAttention, continuous batching, and speculative decoding, showing how they reduce latency, memory overhead, and energy usage per token. Additionally, we'll dive into the role of the LLM Compressor, a lightweight compression framework that shrinks model size while preserving accuracy—further slashing inference costs and power consumption. If you're interested in sustainable LLM deployment, GPU optimization, or how Python can lead the charge in green computing, this talk is for you.
-
Intro: Why LLM Inference Needs to Go Green
- Rising cost of serving models like LLaMA, Mistral, and GPT-family.
- Environmental & energy concerns of always-on inference systems.
-
Enter vLLM: A Python-Based, Efficient LLM Serving Engine
- Quick intro to vLLM and its architecture.
- How Python + CUDA-backed C++ kernels power it under the hood.
-
Core Innovations That Reduce Energy Footprint
- PagedAttention: Minimizing GPU memory fragmentation.
- Speculative Decoding: Fewer wasted compute cycles.
- Continuous Batching: Maximizing throughput without GPU idle time.
-
LLM Compressor: Lightweight Models for Sustainable Inference
- Introduction to LLM Compressor: Architecture-agnostic model shrinking without affecting the accuracy.
- Real-world impact: Lower power usage, smaller carbon footprint, and latency gains.
-
Benchmark: Performance vs. Energy Trade-offs
- Real-world comparison: vLLM vs HuggingFace Transformers
- Metrics: Throughput (tokens/sec), latency, memory usage, and energy draw (watts per token)
- Demo: Live side-by-side inference demo showcasing identical model runs across vLLM and HuggingFace on the same GPU.
-
Deployment Patterns for Sustainable Inference
- Quantization tips.
-
Closing Thoughts
- Why Python is still central to sustainable ML.
- Practical takeaways for engineers looking to scale responsibly.
Intermediate
With over a decade in Data and Decision Sciences, I design NLP and AI solutions that solve complex business challenges. Currently a Data Scientist at Red Hat and former researcher at Tata Research Development and Design Center, I have presented research at premier conferences and hold patents, advancing AI-driven innovations. Explore my Google Scholar page -https://scholar.google.com/citations?user=5GCQcVkAAAAJ&hl=en&oi=ao
I’m a Software Maintenance Engineer with a focus on quantization and fine-tuning large language models (LLMs). I work with tools and frameworks like vLLM, LLM Compressor, InstructLab, and RHEL AI to optimize and maintain high-performance AI systems.
My work revolves around making LLMs more efficient, scalable, and adaptable for real-world use cases—whether it’s reducing inference costs, enhancing model alignment, or supporting enterprise-grade AI deployments.