Question

What is the Wasserstein distance in 1D using Python?

Answer and Explanation

The Wasserstein distance, also known as the Earth Mover's Distance (EMD), is a metric used to measure the distance between two probability distributions. In 1D, it can be intuitively understood as the minimum amount of "work" required to transform one distribution into another, where work is defined as the amount of probability mass that needs to be moved multiplied by the distance it is moved. Let's explore how to calculate this in Python.

Understanding the 1D Wasserstein Distance

Imagine you have two piles of dirt, representing two 1D distributions. The Wasserstein distance is like finding the most efficient way to move the dirt from the first pile to the second, where "efficient" is defined as the total distance the dirt is moved. If one pile is at position 2 and the other at 5, for example, and they have a mass of 1 each, the work (distance) would be 3. It is calculated by sorting the distributions and cumulating the differences.

Python Implementation

We can use the scipy.stats library, which provides a function wasserstein_distance that calculates the EMD (Wasserstein distance). Here's a simple example.

from scipy.stats import wasserstein_distance

# Example 1D distributions
distribution_1 = [1, 2, 3, 4, 5]
distribution_2 = [3, 4, 5, 6, 7]

# Calculate the Wasserstein distance
distance = wasserstein_distance(distribution_1, distribution_2)

print(f"Wasserstein distance: {distance}")

Let's see another example with a few float values:

from scipy.stats import wasserstein_distance

# Example distributions with float values
distribution_1_float = [1.1, 2.2, 3.3, 4.4, 5.5]
distribution_2_float = [3.1, 4.2, 5.3, 6.4, 7.5]

# Calculate the Wasserstein distance
distance_float = wasserstein_distance(distribution_1_float, distribution_2_float)

print(f"Wasserstein distance (float): {distance_float}")

Key points to remember:

- The wasserstein_distance function from scipy.stats makes it straightforward to calculate the Wasserstein distance in Python.
- The input distributions do not necessarily need to have the same length or sum to 1. The wasserstein_distance method handles it.
- The returned value is always non-negative and is zero if and only if the distributions are equal.

This implementation provides a way to use the Wasserstein distance as an efficient tool for measuring the dissimilarity between two 1D distributions in various application contexts.

More questions