2025-09-13 –, Track 1
Python is often criticized for its Global Interpreter Lock (GIL), which is seen as a bottleneck for high-performance computing. However, this talk showcases how Python, when used with right design principle, can deliver 40x throughput improvement w.r.t. baseline implementation in a real-world image segmentation post-processing pipeline. We present a generic rectangle partitioning algorithm that converts irregular segments into precise, non-overlapping rectangles.
By leveraging multiprocessing, ProcessPoolExecutor, and Numba JIT compilation with shared memory and inter-process communication (IPC), we overcome GIL limitations and scale efficiently across cores. This session will walk through the architectural decisions, performance bottlenecks, and Pythonic optimizations that made this possible—demonstrating that Python, with the right tools and mindset, can be both elegant and fast.
This talk begins with a real-world problem from semiconductor manufacturing: converting irregular image segments into precise, non-overlapping rectangles for downstream processing. We’ll start by introducing the application context and the initial naive solution built using Python’s cv2 and NumPy libraries. A flowchart will illustrate the basic pipeline of the naive approach and its limitations in terms of performance and scalability.
From there, the presentation follows a step-by-step journey of iterative optimization. First, we explore basic multiprocessing to parallelize the workload, which brings modest gains but introduces memory overhead. Next, we enhance this with multiprocessing Manager to share state more efficiently, followed by a producer-consumer model using Queue to better distribute tasks.
The final breakthrough comes with ProcessPoolExecutor, which simplifies process management and maximizes CPU utilization. Along the way additionally , we also leverage Numba for JIT compilation of compute-heavy functions. The key message: Python’s tools are powerful, but engineering design and correct usage are what unlock real performance.
Detailed Outline
The talk will progressively tackle the problem, addressing challenges and improvements through an iterative, step-by-step approach.
Introduction: The Real-World Challenge
In semiconductor manufacturing, precision is paramount. One critical task involves converting irregular image segments—often noisy, overlapping, or imprecise—into clean, non-overlapping rectangles.
1. Naive Approach
The initial solution uses Python’s OpenCV (cv2) and NumPy libraries. A computation-bound algorithm was written from scratch to solve the problem. This approach is straightforward but lacks scalability and performance.
Pros:
- Simple to implement
- Uses well-known libraries
Cons:
- Single-threaded
- High memory usage
- Poor scalability
2. Basic Multiprocessing
Python’s multiprocessing module is used to parallelize the workload across CPU cores. This improves speed but increases memory usage.
Pros:
- Utilizes multiple CPU cores
- Faster than naive approach
Cons:
- High memory overhead
- Complex process management
3. Manager-based Sharing
Using multiprocessing.Manager allows shared data structures across processes, reducing memory duplication.
Pros:
- Reduced memory usage
- Shared state across processes
Cons:
- Slower inter-process communication
- Slightly more complex code
4. Producer-Consumer Model
A producer-consumer pattern using Queue is implemented to dynamically distribute tasks.
Pros:
- Better load balancing
- Scales well with complexity
Cons:
- Requires careful synchronization
- More complex architecture
5. ProcessPoolExecutor
The concurrent.futures.ProcessPoolExecutor simplifies process management and improves CPU utilization.
Pros:
- Cleaner syntax
- Automatic process pooling
- High CPU utilization
Cons:
- Less control over individual processes
6. Bonus: Numba JIT Optimization
Numba is used to apply Just-In-Time (JIT) compilation to compute-heavy functions.
Pros:
- 5x–10x speedup
- Easy to apply with @jit decorator
Cons:
- Limited to numerical functions
- May require code refactoring
7. Key Takeaways
- Python is powerful, but performance comes from engineering design, not just libraries.
- Parallelism must be carefully managed to avoid memory and synchronization issues.
- ProcessPoolExecutor and Numba are game-changers for CPU-bound tasks.
- Profiling and iteration are essential—what works for one dataset may not scale.
Basic Knowledge of Python
Familiarity with Python Performance
Understanding of Compilation Concepts
Knowledge of Performance Profiling
Intermediate Python Experience
Intermediate
I'm a computer vision engineer with a strong foundation in classical image processing and a passion for modern AI. My journey began at BARC, where I tackled scientific imaging challenges, and evolved at Applied Materials, where I built deep learning-based solutions for semiconductor manufacturing. I specialize in fusing traditional vision techniques with deep learning to create scalable, interpretable, and high-performance systems. I thrive at the intersection of research and real-world application—bringing AI from concept to product.
Beyond the lab and code, I find balance and inspiration in the outdoors and the arts. Trekking fuels my sense of exploration, much like research does. Reading and writing in Hindi keep me grounded in language and expression, while cooking—both a science and an art—mirrors my love for experimentation and creativity. Whether it's crafting a model or a meal, I enjoy the process of building something meaningful from the ground up.