Data Morph: A Cautionary Tale of Summary Statistics PyCon India 2025

Data Morph: A Cautionary Tale of Summary Statistics
.ical

2025-09-13 10:50–11:20, Track 3

Statistics do not come intuitively to humans; they always try to find simple ways to describe complex things. Given a complex dataset, they may feel tempted to use simple summary statistics like the mean, median, or standard deviation to describe it. However, these numbers are not a replacement for visualizing the distribution.

To illustrate this fact, researchers have generated many datasets that are very different visually, but share the same summary statistics. In this talk, I will discuss Data Morph, an open source package that builds on previous research using simulated annealing to perturb an arbitrary input dataset into a variety of shapes, while preserving the mean, standard deviation, and correlation to multiple decimal points. I will showcase how it works, discuss the challenges faced during development, and explore the limitations of this approach.

Summary statistics aren't enough (10 minutes)
- Walk through various commonly used summary statistics and why they don't tell you anything about the distribution of the data, with visual examples
- Some people may learn about statistics they have never heard about (no equations, just descriptions of what they mention): moments, kurtosis, skewness.
- Anscombe's Quartet, Datasaurus Dozen, "A hypothesis is a liability" experiment
Introduction to the Data Morph Python package I built (10 minutes)
- Explain the gaps it fills
- Show fun example
- Code samples for installation and morphing
- Discuss shape creation, sizing, and positioning
- Discuss new ideas that make this a possible extension of the previous research
Limitations and future work (5 minutes)
- Show some of the things that could be improved.
- Discuss why those occur
Lessons learned (3 minutes)
- Discuss my experience extending external research
- Discuss a little bit of what it takes to start an open source project
- Some helpful resources for what I needed to learn how to do
- Closing remarks

Prerequisites –

This talk is intended for people of all levels. People should understand what the mean, standard deviation, variance, and correlation are at a high level (no equations necessary).

Additional Resources –

https://stefaniemolin.com/data-morph/
https://stefaniemolin.com/data-morph-talk/
https://github.com/stefmolin/data-morph

Target Audience –

Beginner

Stefanie Molin

Stefanie Molin is a software engineer at Bloomberg in New York City, where she tackles tough problems in information security, particularly those revolving around data wrangling/visualization, building tools for gathering data, and knowledge sharing. She is also a core developer of numpydoc and the author of “Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization,” which is currently in its second edition and has been translated into Korean and Chinese. She holds a bachelor’s of science degree in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, as well as a master’s degree in computer science, with a specialization in machine learning, from Georgia Tech. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.

Data Morph: A Cautionary Tale of Summary Statistics .ical 2025-09-13 10:50–11:20, Track 3

Data Morph: A Cautionary Tale of Summary Statistics
.ical

2025-09-13 10:50–11:20, Track 3