PyCon India 2025

Data Morph: A Cautionary Tale of Summary Statistics
2025-09-13 , Track 3

Statistics do not come intuitively to humans; they always try to find simple ways to describe complex things. Given a complex dataset, they may feel tempted to use simple summary statistics like the mean, median, or standard deviation to describe it. However, these numbers are not a replacement for visualizing the distribution.

To illustrate this fact, researchers have generated many datasets that are very different visually, but share the same summary statistics. In this talk, I will discuss Data Morph, an open source package that builds on previous research using simulated annealing to perturb an arbitrary input dataset into a variety of shapes, while preserving the mean, standard deviation, and correlation to multiple decimal points. I will showcase how it works, discuss the challenges faced during development, and explore the limitations of this approach.


  1. Summary statistics aren't enough (10 minutes)
    • Walk through various commonly used summary statistics and why they don't tell you anything about the distribution of the data, with visual examples
    • Some people may learn about statistics they have never heard about (no equations, just descriptions of what they mention): moments, kurtosis, skewness.
    • Anscombe's Quartet, Datasaurus Dozen, "A hypothesis is a liability" experiment
  2. Introduction to the Data Morph Python package I built (10 minutes)
    • Explain the gaps it fills
    • Show fun example
    • Code samples for installation and morphing
    • Discuss shape creation, sizing, and positioning
    • Discuss new ideas that make this a possible extension of the previous research
  3. Limitations and future work (5 minutes)
    • Show some of the things that could be improved.
    • Discuss why those occur
  4. Lessons learned (3 minutes)
    • Discuss my experience extending external research
    • Discuss a little bit of what it takes to start an open source project
    • Some helpful resources for what I needed to learn how to do
    • Closing remarks

Prerequisites

This talk is intended for people of all levels. People should understand what the mean, standard deviation, variance, and correlation are at a high level (no equations necessary).

Additional Resources

https://stefaniemolin.com/data-morph/
https://stefaniemolin.com/data-morph-talk/
https://github.com/stefmolin/data-morph

Target Audience

Beginner

Stefanie Molin is a software engineer at Bloomberg in New York City, where she tackles tough problems in information security, particularly those revolving around data wrangling/visualization, building tools for gathering data, and knowledge sharing. She is also a core developer of numpydoc and the author of “Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization,” which is currently in its second edition and has been translated into Korean and Chinese. She holds a bachelor’s of science degree in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, as well as a master’s degree in computer science, with a specialization in machine learning, from Georgia Tech. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.