Conditioning is grouping by

November 30, 2023 · 12 mins · 2167 words

what’s $y|X?$

Last year I managed to read more papers than in my entire life. 1. While doing so I developed an intuition about conditional expressions that I think could be of help to more people. tldr: conditional expressions can be interpreted as groupby operations.

A usual object in machine learning literature (and stats literature in general) are conditional expressions, ie $y|X$, which reads as the “$y$ being conditioned to $X$”. For example, one can compute the expected value of a random variable $y$ conditioned to another random variable $X$ being exactly $x$, which is written as $\mathbb{E}(y|X=x)$. Then, to compute it you can use $\mathbb{E}(y|X=x) = \int y P(y|x) dy$ where $P(y|X)$ is the distribution of $y$ conditioned on $X$, ie another conditional object.

As a starting point for this post, I’ll use one of the first derivations from “Elements of Statistical Learning” (p. 18), where the authors show that the best option to predict a value $y$ from features $X$ is to use the estimator

\[f(x) = \mathbb{E}(y | X=x)\]

The first time I read that I felt very weird since I understood all the maths behind this formula but I couldn’t get any intuitive interpretation about it. What did it mean that a function of $x$ is defined as a conditional expectation on $x$? How do you compute this function? What does it look like? Why I’m learning this shit if xgboost is all you need?

some intuition

After thinking about it for some weeks (I’m a slow learner) I realized that the formula was only saying “the best prediction for a given set of features $x$ is just to take all the other examples with the same features $x$ and average their $y$”. In real machine learning you don’t usually have multiple examples with the same features, and this is why more complex machine learning algorithms are used. But this is a story for another day, today I want to talk about conditional distributions.

After learning how to read the equation I felt a little bit better since I got some intuition about it, but it wasn’t the end. After some more weeks of ruminating about it (sometimes I’m very slow) I realized that the interpretation of $\mathbb{E}(y | X=x)$ was familiar. Wasn’t this interpretation following the same logic as .groupby in pandas? If for a given dataframe df I wanted to compute the average value of a column y for each group in column X I would do df.groupby(X)[y].mean(). Isn’t it quite similar to $\mathbb{E}(y | X=x)$?

formalizing intuition

So here it goes the formalized version of my intuition

$\mathbb{E}(y | X=x) \sim$ df.groupby(X)[y].mean()

This is, the idea behind conditional expressions is the same idea behind groupby operations, which are present in multiple languages and packages (itertools, pandas, scala, rust, SQL, etc.).

I’ll present now some examples to make my thoughts a little bit clearer. I’ll use pandas’ implementation of groupby since I think almost everyone is familiar with it. Let’s build a dataframe that consists of groups and values, and then compute the conditional expected value for each group

import pandas as pd
df = pd.DataFrame({
       "group": [1, 1, 1, 2, 2, 3, 3, 3],
       "value": [5, 2, 5, 1, 1, 9, 8, 7]

which returns us

|   group |   value |
|       1 |       4 |
|       2 |       1 |
|       3 |       8 |

And this is basically the same as $\mathbb{E}(y | X=x)$ where $y = \text{value}$ and $X = \text{group}$. Here we have computed the conditional expected value, but you can also use groupby to compute the distribution using value_counts(normalized=True)


and you’ll get

|        | proportion   |
| (1, 5) | 0.666667     |
| (1, 2) | 0.333333     |
| (2, 1) | 1            |
| (3, 7) | 0.333333     |
| (3, 8) | 0.333333     |
| (3, 9) | 0.333333     |

which is a of $P(y|X)$, ie: the distribution of $y$ for each group in $X$. For example, for group 2 we see that the distribution has all the mass around 1, and for group 3 the distribution is uniform between 7, 8, and 9.

what about continuous variables?

Those readers used to working with conditional probabilities will have noticed that there are some flaws in my reasoning. The main one is that with probabilities we can condition on continuous values, ie: $P(\text{salary} | \text{height})$, while if we groupby by a continuous column we will get groups of only one element. However, we can overcome this limitation by imagining an infinite dataframe that contains the full distribution in the grouping column. This is, a dataframe with a column named height that contains all the possible heights and another column named salary that for each height contains the distribution of salaries. The cardinality of this dataframe is $\mathbb{R}^2$ and it’s impossible to build it, but we can imagine it and apply the same intuition as in the previous section.

bayes theorem

To talk about conditioned probabilities is to talk about Bayes’ theorem. The theorem reads

\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]

As a final exercise for this post, I’ll show that the presented intuition can be used to reproduce the Bayes theorem.

To show it we can create a dataframe with two columns: salary take random integer values between 40000 and 200000 and height takes values between 140 and 220. With the code in 2 we can create the dataframe and compute $\frac{P(s|h) P(h)}{P(s)}$ and $P(h|s)$.

According to Bayes’ theorem, we expect the column P(s|h) P(h) / P(s) to be equal to P(h|s). If you run the code in 2 and sample 10 random rows you’ll get something similar to

| salary   | height | P(h|s)    | P(s|h) P(h) / P(s)  |
| 59693    | 145    | 0.0357143 | 0.0357143           |
| 68419    | 168    | 0.0666667 | 0.0666667           |
| 155131   | 184    | 0.030303  | 0.030303            |
| 69165    | 187    | 0.0487805 | 0.0487805           |
| 49761    | 186    | 0.0344828 | 0.0344828           |
| 196511   | 153    | 0.0238095 | 0.0238095           |
| 113707   | 184    | 0.027027  | 0.027027            |
| 116071   | 203    | 0.025641  | 0.025641            |
| 193425   | 149    | 0.0555556 | 0.0555556           |
| 162955   | 199    | 0.03125   | 0.03125             |

Cool! The actual P(height | salary) given by the data coincides with the computed values using Bayes theorem. Of course I wasn’t expecting the opposite - my plan wasn’t to disprove Bayes theorem in a 1000 words post - but it’s interesting that you can do all this computations using the intuition I explained here.


In this post, I presented my intuition about conditioning in statistics. I also showed that Bayes’ theorem holds within this intuition. So next time you find a weird $y | X$ formula don’t panic and remember that it just a fancy way of saying “I’m grouping the data”.

I’m sure any mathematician reading this will be horrified and could point out dozens of errors in my reasoning. Too bad I don’t care. But if you can improve my intuition feel free to write and enlighten me.

  1. When I finished my Physics master I thought I would never read a paper again, and it made me a little bit sad. But last year my incredible wife bought me a tablet with a stylus and since then I’ve been devouring papers. Being able to read a paper and take handwritten notes directly without having to print it has been a game changer for me. 

  2. I didn’t want to pollute the text with this monstrosity, but there you have how to use groupby to compute posterior distributions.

    N = 5_000_000
    height = np.random.randint(140, 220, N)
    salary = np.random.randint(40_000, 200_000, N)
    df = pd.DataFrame({'height': height, 'salary': salary})
    # compute conditional and absolute probabilities
    p_ba = (df
            .rename(columns={"proportion": "P(salary|height)"}))
    p_ab = (df
            .rename(columns={"proportion": "P(height|salary)"}))
    p_a = (df[['height']]
            .rename(columns={"proportion": "P(height)"}))
    p_b = (df[['salary']]
            .rename(columns={"proportion": "P(salary)"}))
    # compute P(B|A) * P(A) / P(B)
    num = p_ba.merge(p_a)
    num["P(salary|height) P(height)"] = num["P(salary|height)"] * num["P(height)"]
    tot = num.merge(p_b)
    tot["P(salary|height) P(height) / P(salary)"] = tot["P(salary|height) P(height)"] / tot["P(salary)"]
    full_probs = p_ab.merge(tot)