Question
Answer and Explanation
The Wasserstein distance, also known as the Earth Mover's Distance (EMD), is a metric used to measure the distance between two probability distributions. In 1D, it can be intuitively understood as the minimum amount of "work" required to transform one distribution into another, where work is defined as the amount of probability mass that needs to be moved multiplied by the distance it is moved. Let's explore how to calculate this in Python.
Understanding the 1D Wasserstein Distance
Imagine you have two piles of dirt, representing two 1D distributions. The Wasserstein distance is like finding the most efficient way to move the dirt from the first pile to the second, where "efficient" is defined as the total distance the dirt is moved. If one pile is at position 2 and the other at 5, for example, and they have a mass of 1 each, the work (distance) would be 3. It is calculated by sorting the distributions and cumulating the differences.
Python Implementation
We can use the scipy.stats
library, which provides a function wasserstein_distance
that calculates the EMD (Wasserstein distance). Here's a simple example.
from scipy.stats import wasserstein_distance
# Example 1D distributions
distribution_1 = [1, 2, 3, 4, 5]
distribution_2 = [3, 4, 5, 6, 7]
# Calculate the Wasserstein distance
distance = wasserstein_distance(distribution_1, distribution_2)
print(f"Wasserstein distance: {distance}")
Let's see another example with a few float values:
from scipy.stats import wasserstein_distance
# Example distributions with float values
distribution_1_float = [1.1, 2.2, 3.3, 4.4, 5.5]
distribution_2_float = [3.1, 4.2, 5.3, 6.4, 7.5]
# Calculate the Wasserstein distance
distance_float = wasserstein_distance(distribution_1_float, distribution_2_float)
print(f"Wasserstein distance (float): {distance_float}")
Key points to remember:
- The wasserstein_distance
function from scipy.stats
makes it straightforward to calculate the Wasserstein distance in Python.
- The input distributions do not necessarily need to have the same length or sum to 1. The wasserstein_distance
method handles it.
- The returned value is always non-negative and is zero if and only if the distributions are equal.
This implementation provides a way to use the Wasserstein distance as an efficient tool for measuring the dissimilarity between two 1D distributions in various application contexts.