Jekyll2023-12-04T18:12:30+00:00https://alexmolas.com/feed.xmlAlex Molas BlogWelcome to my personal blog! Here you'll find a glimpse into my life as a husband, father, and data scientist (in that order).Conditioning is grouping by2023-11-30T00:00:00+00:002023-11-30T00:00:00+00:00https://alexmolas.com/2023/11/30/conditioning-is-grouping-by<h1 id="whats-yx">what’s $y|X?$</h1>
<p>Last year I managed to read more papers than in my entire life. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. While doing so I developed an intuition about conditional expressions that I think could be of help to more people. tldr: conditional expressions can be interpreted as <code class="language-plaintext highlighter-rouge">groupby</code> operations.</p>
<p>A usual object in machine learning literature (and stats literature in general) are conditional expressions, ie $y|X$, which reads as the “$y$ being conditioned to $X$”. For example, one can compute the expected value of a random variable $y$ conditioned to another random variable $X$ being exactly $x$, which is written as $\mathbb{E}(y|X=x)$. Then, to compute it you can use $\mathbb{E}(y|X=x) = \int y P(y|x) dy$ where $P(y|X)$ is the distribution of $y$ conditioned on $X$, ie another conditional object.</p>
<p>As a starting point for this post, I’ll use one of the first derivations from “<a href="https://hastie.su.domains/Papers/ESLII.pdf">Elements of Statistical Learning</a>” (p. 18), where the authors show that the best option to predict a value $y$ from features $X$ is to use the estimator</p>
\[f(x) = \mathbb{E}(y | X=x)\]
<p>The first time I read that I felt very weird since I understood all the maths behind this formula but I couldn’t get any intuitive interpretation about it. What did it mean that a function of $x$ is defined as a conditional expectation on $x$? How do you compute this function? What does it look like? Why I’m learning this shit if <a href="https://twitter.com/tunguz/status/1509197350576672769">xgboost is all you need</a>?</p>
<h1 id="some-intuition">some intuition</h1>
<p>After thinking about it for some weeks (I’m a slow learner) I realized that the formula was only saying “the best prediction for a given set of features $x$ is just to take all the other examples with the same features $x$ and average their $y$”. In real machine learning you don’t usually have multiple examples with the same features, and this is why more complex machine learning algorithms are used. But this is a story for another day, today I want to talk about conditional distributions.</p>
<p>After learning how to read the equation I felt a little bit better since I got some intuition about it, but it wasn’t the end. After some more weeks of ruminating about it (sometimes I’m very slow) I realized that the interpretation of $\mathbb{E}(y | X=x)$ was familiar. Wasn’t this interpretation following the same logic as <code class="language-plaintext highlighter-rouge">.groupby</code> in pandas? If for a given dataframe <code class="language-plaintext highlighter-rouge">df</code> I wanted to compute the average value of a column <code class="language-plaintext highlighter-rouge">y</code> for each group in column <code class="language-plaintext highlighter-rouge">X</code> I would do <code class="language-plaintext highlighter-rouge">df.groupby(X)[y].mean()</code>. Isn’t it quite similar to $\mathbb{E}(y | X=x)$?</p>
<h1 id="formalizing-intuition">formalizing intuition</h1>
<p>So here it goes the formalized version of my intuition</p>
<blockquote>
<p>$\mathbb{E}(y | X=x) \sim$ <code class="language-plaintext highlighter-rouge">df.groupby(X)[y].mean()</code></p>
</blockquote>
<p>This is, the idea behind conditional expressions is the same idea behind <code class="language-plaintext highlighter-rouge">groupby</code> operations, which are present in multiple languages and packages (<a href="https://docs.python.org/3/library/itertools.html?highlight=groupby#itertools.groupby">itertools</a>, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">pandas</a>, <a href="https://www.scala-lang.org/api/2.12.4/scala/collection/parallel/ParIterableLike$GroupBy.html">scala</a>, <a href="https://docs.rs/itertools/latest/itertools/structs/struct.GroupBy.html">rust</a>, <a href="https://learn.microsoft.com/es-es/sql/t-sql/queries/select-group-by-transact-sql?view=sql-server-ver16">SQL</a>, etc.).</p>
<p>I’ll present now some examples to make my thoughts a little bit clearer. I’ll use pandas’ implementation of <code class="language-plaintext highlighter-rouge">groupby</code> since I think almost everyone is familiar with it. Let’s build a dataframe that consists of groups and values, and then compute the conditional expected value for each group</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="s">"group"</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">],</span>
<span class="s">"value"</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
<span class="p">})</span>
<span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"group"</span><span class="p">)[</span><span class="s">"value"</span><span class="p">].</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div></div>
<p>which returns us</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">|</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|--------:|--------:|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="o">|</span><span class="w">
</span></code></pre></div></div>
<p>And this is basically the same as $\mathbb{E}(y | X=x)$ where $y = \text{value}$ and $X = \text{group}$. Here we have computed the conditional expected value, but you can also use <code class="language-plaintext highlighter-rouge">groupby</code> to compute the distribution using <code class="language-plaintext highlighter-rouge">value_counts(normalized=True)</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">"group"</span><span class="p">).</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p>and you’ll get</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">|</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">proportion</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|:-------|-------------:|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.666667</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.333333</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.333333</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.333333</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.333333</span><span class="w"> </span><span class="o">|</span><span class="w">
</span></code></pre></div></div>
<p>which is a of $P(y|X)$, ie: the distribution of $y$ for each group in $X$. For example, for group <code class="language-plaintext highlighter-rouge">2</code> we see that the distribution has all the mass around <code class="language-plaintext highlighter-rouge">1</code>, and for group <code class="language-plaintext highlighter-rouge">3</code> the distribution is uniform between <code class="language-plaintext highlighter-rouge">7</code>, <code class="language-plaintext highlighter-rouge">8</code>, and <code class="language-plaintext highlighter-rouge">9</code>.</p>
<h1 id="what-about-continuous-variables">what about continuous variables?</h1>
<p>Those readers used to working with conditional probabilities will have noticed that there are some flaws in my reasoning. The main one is that with probabilities we can condition on continuous values, ie: $P(\text{salary} | \text{height})$, while if we groupby by a continuous column we will get groups of only one element. However, we can overcome this limitation by imagining an infinite dataframe that contains the full distribution in the grouping column. This is, a dataframe with a column named <code class="language-plaintext highlighter-rouge">height</code> that contains all the possible heights and another column named <code class="language-plaintext highlighter-rouge">salary</code> that for each height contains the distribution of salaries. The cardinality of this dataframe is $\mathbb{R}^2$ and it’s impossible to build it, but we can imagine it and apply the same intuition as in the previous section.</p>
<h1 id="bayes-theorem">bayes theorem</h1>
<p>To talk about conditioned probabilities is to talk about Bayes’ theorem. The theorem reads</p>
\[P(A|B) = \frac{P(B|A) P(A)}{P(B)}\]
<p>As a final exercise for this post, I’ll show that the presented intuition can be used to reproduce the Bayes theorem.</p>
<p>To show it we can create a dataframe with two columns: <code class="language-plaintext highlighter-rouge">salary</code> take random integer values between <code class="language-plaintext highlighter-rouge">40000</code> and <code class="language-plaintext highlighter-rouge">200000</code> and <code class="language-plaintext highlighter-rouge">height</code> takes values between <code class="language-plaintext highlighter-rouge">140</code> and <code class="language-plaintext highlighter-rouge">220</code>. With the code in <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> we can create the dataframe and compute $\frac{P(s|h) P(h)}{P(s)}$ and $P(h|s)$.</p>
<p>According to Bayes’ theorem, we expect the column <code class="language-plaintext highlighter-rouge">P(s|h) P(h) / P(s)</code> to be equal to <code class="language-plaintext highlighter-rouge">P(h|s)</code>. If you run the code in <sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> and sample 10 random rows you’ll get something similar to</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">|</span><span class="w"> </span><span class="n">salary</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">P</span><span class="p">(</span><span class="n">h</span><span class="o">|</span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">P</span><span class="p">(</span><span class="n">s</span><span class="o">|</span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="n">P</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">P</span><span class="p">(</span><span class="n">s</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|---------:|-------:|----------:|--------------------:|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">59693</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">145</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0357143</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0357143</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">68419</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">168</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0666667</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0666667</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">155131</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">184</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.030303</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.030303</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">69165</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">187</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0487805</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0487805</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">49761</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">186</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0344828</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0344828</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">196511</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">153</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0238095</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0238095</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">113707</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">184</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.027027</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.027027</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">116071</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">203</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.025641</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.025641</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">193425</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">149</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0555556</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.0555556</span><span class="w"> </span><span class="o">|</span><span class="w">
</span><span class="o">|</span><span class="w"> </span><span class="m">162955</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">199</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.03125</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">0.03125</span><span class="w"> </span><span class="o">|</span><span class="w">
</span></code></pre></div></div>
<p>Cool! The actual <code class="language-plaintext highlighter-rouge">P(height | salary)</code> given by the data coincides with the computed values using Bayes theorem. Of course I wasn’t expecting the opposite - my plan wasn’t to disprove Bayes theorem in a 1000 words post - but it’s interesting that you can do all this computations using the intuition I explained here.</p>
<h1 id="conclusions">conclusions</h1>
<p>In this post, I presented my intuition about conditioning in statistics. I also showed that Bayes’ theorem holds within this intuition. So next time you find a weird $y | X$ formula don’t panic and remember that it just a fancy way of saying “I’m grouping the data”.</p>
<p>I’m sure any mathematician reading this will be horrified and could point out dozens of errors in my reasoning. <a href="https://www.alexmolas.com/2023/07/15/nobody-cares-about-your-blog.html">Too bad I don’t care</a>. But if you can improve my intuition feel free to write and enlighten me.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>When I finished my Physics master I thought I would never read a paper again, and it made me a little bit sad. But last year my incredible wife bought me a tablet with a stylus and since then I’ve been devouring papers. Being able to read a paper and take handwritten notes directly without having to print it has been a game changer for me. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>I didn’t want to pollute the text with this monstrosity, but there you have how to use <code class="language-plaintext highlighter-rouge">groupby</code> to compute posterior distributions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">N</span> <span class="o">=</span> <span class="mi">5_000_000</span>
<span class="n">height</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">140</span><span class="p">,</span> <span class="mi">220</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="n">salary</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">40_000</span><span class="p">,</span> <span class="mi">200_000</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'height'</span><span class="p">:</span> <span class="n">height</span><span class="p">,</span> <span class="s">'salary'</span><span class="p">:</span> <span class="n">salary</span><span class="p">})</span>
<span class="c1"># compute conditional and absolute probabilities
</span><span class="n">p_ba</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'height'</span><span class="p">)[[</span><span class="s">'salary'</span><span class="p">]]</span>
<span class="p">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"proportion"</span><span class="p">:</span> <span class="s">"P(salary|height)"</span><span class="p">}))</span>
<span class="n">p_ab</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span>
<span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'salary'</span><span class="p">)[[</span><span class="s">'height'</span><span class="p">]]</span>
<span class="p">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"proportion"</span><span class="p">:</span> <span class="s">"P(height|salary)"</span><span class="p">}))</span>
<span class="n">p_a</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s">'height'</span><span class="p">]]</span>
<span class="p">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">.</span><span class="n">to_frame</span><span class="p">()</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"proportion"</span><span class="p">:</span> <span class="s">"P(height)"</span><span class="p">}))</span>
<span class="n">p_b</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s">'salary'</span><span class="p">]]</span>
<span class="p">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">normalize</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="p">.</span><span class="n">to_frame</span><span class="p">()</span>
<span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"proportion"</span><span class="p">:</span> <span class="s">"P(salary)"</span><span class="p">}))</span>
<span class="c1"># compute P(B|A) * P(A) / P(B)
</span><span class="n">num</span> <span class="o">=</span> <span class="n">p_ba</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">p_a</span><span class="p">)</span>
<span class="n">num</span><span class="p">[</span><span class="s">"P(salary|height) P(height)"</span><span class="p">]</span> <span class="o">=</span> <span class="n">num</span><span class="p">[</span><span class="s">"P(salary|height)"</span><span class="p">]</span> <span class="o">*</span> <span class="n">num</span><span class="p">[</span><span class="s">"P(height)"</span><span class="p">]</span>
<span class="n">tot</span> <span class="o">=</span> <span class="n">num</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">p_b</span><span class="p">)</span>
<span class="n">tot</span><span class="p">[</span><span class="s">"P(salary|height) P(height) / P(salary)"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tot</span><span class="p">[</span><span class="s">"P(salary|height) P(height)"</span><span class="p">]</span> <span class="o">/</span> <span class="n">tot</span><span class="p">[</span><span class="s">"P(salary)"</span><span class="p">]</span>
<span class="n">full_probs</span> <span class="o">=</span> <span class="n">p_ab</span><span class="p">.</span><span class="n">merge</span><span class="p">(</span><span class="n">tot</span><span class="p">)</span>
</code></pre></div> </div>
<p><a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
</ol>
</div>what’s $y|X?$The least controversial movie2023-11-14T00:00:00+00:002023-11-14T00:00:00+00:00https://alexmolas.com/2023/11/14/controversial-movies<h1 id="tldr">tldr</h1>
<p><em>Escape from Alcatraz</em> is the least controversial movie ever, and <em>The Room</em> is the most controversial.</p>
<h1 id="background">Background</h1>
<p>I tend to discuss a lot with my friends, even more so after a beer or two. I believe our friendship is built upon countless hours discussing almost every topic you can imagine. One of the topics we usually discuss is movies and series. We recommend each other movies or series we enjoyed -or not <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">1</a></sup>- and then talk about them. A lot of our discussions end up in an “ask the audience” moment and looking in IMDB at the score of the movie, which usually ends up with one of us saying “Meh, what do people know about good cinema anyway?”. But from time to time a rare event happens and we find a movie we all agree it’s great or it’s bad.</p>
<p>After one of these rare moments, I started wondering about controversial movies and how often people agree on the quality of a cinematographic piece. Imagine you ask people to rate a movie from $0$ to $5$, then a movie with a narrow distribution around its average means that everyone agrees on the quality of the movie, while a movie with a spread rating distribution means that some people love it and some people hate it.</p>
<h1 id="maths">Maths</h1>
<p>Mathematically this is: two random variables $A$ and $B$ can have the same mean, ie $\mu_A = \mu_B$ but different variances, ie $\sigma^2_A \neq \sigma^2_B$. It’s clear that the minimum value of $\sigma^2$ is $0$, ie everyone agrees on the quality of the movie, but what about the maximum value? Can we have an upper bound on how much can people disagree on a movie? Formally, this is, given an interval $I = (a, b)$ and a random variable $X$ with support $I$ and average $\mu$, which is the maximum variance $\sigma^2$ that $X$ can have.</p>
<p>To answer this question we first need to understand that $\sigma$ measures how spread are the values of a random variable. Therefore, if we want to maximize the variance of a random variable we need to maximize its spread. In our case the most spread two values can be are $b - a$, ie: one value is $b$ and the other is $a$. So if we want to get the maximum spread we want all our values to be either $a$ or $b$, but with the constraint that the average value should be $\mu$. We can achieve this by assigning a fraction $f$ of votes to $a$ and a fraction $1-f$ to b. Using the average constraint we get</p>
\[f\times a + (1 - f) \times b = \mu\]
<p>which means $f = \frac{b-\mu}{b-a}$.</p>
<p>The variance is defined as $\sigma^2 = E\left[X^2\right] - E\left[X\right]^2$, where we already know $E[X] = \mu$. Using the value we got for $f$ we can compute the expected value of $X^2$ and get</p>
\[E\left[X^2\right] = \mu b + \mu a - ab\]
<p>which means</p>
\[\sigma^2 = (b - \mu)(\mu -a).\]
<p>This value is in agreement with <a href="https://en.wikipedia.org/wiki/Bhatia%E2%80%93Davis_inequality">Bhatia-Davis inequality</a> , which is a nice check. In our case, we have $a=0$ and $b=5$ so the expression simplifies to $\sigma^2 = 5\mu - \mu^2$. This means that movies with average scores around $0$ or $5$ have a maximum variance smaller than values around the middle, which makes sense intuitively (to get a low/high score you need everyone to agree on your quality). Notice also that the maximum variance you can get is $\sigma^2 = 2.5$.</p>
<h1 id="data">Data</h1>
<p>So far we have done some maths around the concept of being controversial but as someone said “In God we trust, all the others must bring data”, and thanks to God we are in the era of data. A quick search takes us to the <a href="https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset">Movies Dataset in Kaggle</a> with <em>26 million ratings from 270,000 users for 45,000 movies</em>, and we’ll use this dataset to ask and answer questions about the least and most controversial movies.</p>
<p>Before starting to answer questions I’ll filter out movies with less than $300$ reviews. If you want to do an EDA of the data feel free to do so, but I won’t do it here <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup>.</p>
<h2 id="least-controversial-movies">Least controversial movies</h2>
<p>As we have been saying before we’ll define the “controversiality” of a movie as the variance of the ratings given by users, this is, for two movies with the same average rating we will say the most controversial is the one with the highest rating variance.</p>
<p>Having said so here I present you the least controversial movies in history</p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Title | mean | std
---------------------|------|------
Escape from Alcatraz | 3.78 | 0.67
The Counterfeiters | 3.95 | 0.67
The Lookout | 3.56 | 0.69
Wordplay | 3.81 | 0.69
Presumed Innocent | 3.65 | 0.70
</code></pre></div></div>
<p>Here you can check the rating distributions: <a href="https://www.imdb.com/title/tt0079116/ratings">Escape from Alcatraz</a>, <a href="https://www.imdb.com/title/tt0813547/ratings">The Counterfeiters</a>, <a href="https://www.imdb.com/title/tt0427470/ratings">The Lookout</a>, <a href="https://www.imdb.com/title/tt0492506/ratings">Wordplay</a>, and <a href="https://www.imdb.com/title/tt0100404/ratings">Presumed Innocent</a>. As you can see the rating distributions are pretty concentrated around the average value. Curiously, the least controversial movies are not those with the lowest/highest score, as one would expect from the theoretical upper bound I derived in the last section.</p>
<h2 id="most-controversial-movies">Most controversial movies</h2>
<p>Now, we can answer the opposite question and look for the movies that are more controversial. Here instead of looking for the most controversial movies I decided to stratify the search, ie: look for the most controversial movie with an average rating between 1 and 2, with an average rating between 2 and 3, etc.</p>
<div class="language-markdown highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Range | Title | mean | std
-------|-------------------------|------|------
(0, 1) | From Justin to Kelly | 0.99 | 0.85
(1, 2) | Digimon: The Movie | 1.98 | 1.39
(2, 3) | The Room | 2.41 | 1.72
(3, 4) | Repo! The Genetic Opera | 3.27 | 1.35
(4, 5) | Heart of a Dog | 4.08 | 1.25
</code></pre></div></div>
<p>This means that the most controversial movie ever is The Room, what a surprise eh? Notice that even The Room is far from the maximum variance one could get, so this means we haven’t seen yet the most controversial movie ever. Who knows what the future holds for us? In the next figure I’ve plotted the variance of the most controversial movies and the theoretical upper bound. We still have a huge margin for controversial movies.</p>
<figure>
<img src="/docs/controversial-movies/max-real-vs-max-pred.png" alt="Distribution of The Room and Escape from Alcatraz" width="500" class="center" />
<figcaption class="center">Maximum variance possible for each rating versus the actual variance of the most controversial movie</figcaption>
</figure>
<p>As a last gift, I leave you here the distribution of the two most and least controversial films in history</p>
<figure>
<img src="/docs/controversial-movies/distribution.png" alt="Distribution of The Room and Escape from Alcatraz" width="500" class="center" />
<figcaption class="center">Rating distributions of the least and most controversial movies</figcaption>
</figure>
<h1 id="conclusions">Conclusions</h1>
<p>In this post I’ve studied the topic of controversy in movies from two perspectives: the theoretical one and the experimental one. I derived an upper bound of the “controversiality” of a movie, and then discovered that The Room is the most controversial movie and Escape from Alcatraz is the movie that people agree most about its quality.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:2" role="doc-endnote">
<p>I know one person that used to recommend shitty movies to his friends just to make them lose their time, but this is a story is for another day. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I’m tired of EDAs. I’m tired of forcing myself to ask boring questions about the data. If I have a specific question I’ll go and answer that question, no less no more. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>tldrReading S3 data from local PySpark2023-10-10T00:00:00+00:002023-10-10T00:00:00+00:00https://alexmolas.com/2023/10/10/local-pyspark-s3<p>Today I wanted to run some experiments with PySpark in EMR. Since running an EMR cluster is expensive I decided to first try the code on my local machine, and once I know what to try run it on the EMR cluster. It seems pretty straightforward, isn’t it? But it wasn’t…</p>
<p>My main problem was to read data from s3 from a locally installed PySpark. After googling I discovered several blog posts with overcomplicated solutions that didn’t work work. After spending some hours fighting with this problem I present you a minimal guide that works. No need of downloading jar files, no need of compiling spark yourself, no need of installing specific versions of PySpark.</p>
<h1 id="0-the-problem">0. The problem</h1>
<p>Let’s start with the problem I started with. If you install PySpark as <code class="language-plaintext highlighter-rouge">pip install pyspark</code> and then run</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">foo</span> <span class="o">=</span> <span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">parquet</span><span class="p">(</span><span class="s">"s3://path/to/data"</span><span class="p">)</span>
</code></pre></div></div>
<p>you’ll get the error</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
</code></pre></div></div>
<p>There are two problems with this approach: (1) problems with the filesystem, and (2) problems with the credentials. I’ll solve these problems in the following sections.</p>
<h1 id="1-filesystem">1. Filesystem</h1>
<p>S3 implements multiples interfaces to access the data, namely s3, s3a and s3n. I’m not an expert in this topic, for more info read this <a href="https://luminousmen.com/post/choosing-the-right-aws-storage-service-a-comprehensive-guide-to-s3-s3n-and-s3a">post</a>.</p>
<p>For our case we only need to know</p>
<blockquote>
<p>S3A (Amazon S3A File System) is a newer and recommended Hadoop-compatible interface for accessing data stored in S3. S3A was introduced as part of Apache Hadoop 2.7.0. It is built on top of the S3 protocol and uses the S3 object API to provide better performance, scalability, and functionality compared to S3N.</p>
</blockquote>
<p>So, first we need to fix the URL. From <code class="language-plaintext highlighter-rouge">s3://path/to/data</code> to <code class="language-plaintext highlighter-rouge">s3a://path/to/data</code>. Notice the extra a. After applying this change we get a new error (<a href="https://www.commitstrip.com/en/2018/05/09/progress/?">hooray</a>)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
</code></pre></div></div>
<p>Spark no longer complains about the filesystem but about a missing class. This means that we miss libraries to read from s3a. To solve that we only need to add this missing dependency</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.jars.packages"</span><span class="p">,</span> <span class="s">"org.apache.hadoop:hadoop-aws:3.2.0"</span><span class="p">)</span>
</code></pre></div></div>
<p>Now the code looks like</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark</span> <span class="kn">import</span> <span class="n">SparkConf</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="n">conf</span> <span class="o">=</span> <span class="n">SparkConf</span><span class="p">()</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.jars.packages"</span><span class="p">,</span> <span class="s">"org.apache.hadoop:hadoop-aws:3.2.0"</span><span class="p">)</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="n">conf</span><span class="o">=</span><span class="n">conf</span><span class="p">).</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">path</span> <span class="o">=</span> <span class="s">"s3a://path/to/data/"</span>
<span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</code></pre></div></div>
<p>The solution some blogs provide for this problem is to download from maven some jar files and the pass them to the spark configuration. I don’t know if this was needed in the past, but currently you don’t need to download external dependencies, you only need to specify them in the code and spark will take care of them.</p>
<p>However, after this change we still have one error (<a href="https://www.commitstrip.com/en/2018/05/09/progress/?">hooray</a>)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>23/10/09 16:32:26 WARN FileSystem: Failed to initialize fileystem [s3a://path/to/data/](s3a://path/to/data):
java.nio.file.AccessDeniedException: BUCKET:
org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider :
org.apache.hadoop.fs.s3a.CredentialInitializationException: Access key, secret key or session token is unset
</code></pre></div></div>
<h1 id="2-credentials">2. Credentials</h1>
<p>From the last Spark error we see we miss the credentials to read from S3. We can configure the credentials using</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.access.key"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.secret.key"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.session.token"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
</code></pre></div></div>
<p>Since you don’t want to have your secrets hard coded in your scripts I recommend reading them from the credentials file and injecting them</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">configparser</span>
<span class="n">config</span> <span class="o">=</span> <span class="n">configparser</span><span class="p">.</span><span class="n">ConfigParser</span><span class="p">()</span>
<span class="n">config</span><span class="p">.</span><span class="n">read</span><span class="p">(</span><span class="s">"/Users/alexmolas/.aws/credentials"</span><span class="p">)</span>
<span class="n">access_key</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"prod"</span><span class="p">,</span> <span class="s">"aws_access_key_id"</span><span class="p">)</span>
<span class="n">secret_key</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"prod"</span><span class="p">,</span> <span class="s">"aws_secret_access_key"</span><span class="p">)</span>
<span class="n">session_token</span> <span class="o">=</span> <span class="n">config</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"prod"</span><span class="p">,</span> <span class="s">"aws_session_token"</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="3-putting-everything-together">3. Putting everything together</h1>
<p>If we put all the changes together we get the following snippet</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark</span> <span class="kn">import</span> <span class="n">SparkConf</span>
<span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
<span class="n">conf</span> <span class="o">=</span> <span class="n">SparkConf</span><span class="p">()</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.jars.packages"</span><span class="p">,</span>
<span class="s">"org.apache.hadoop:hadoop-aws:3.2.0"</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.aws.credentials.provider"</span><span class="p">,</span>
<span class="s">"org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.access.key"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.secret.key"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
<span class="n">conf</span><span class="p">.</span><span class="nb">set</span><span class="p">(</span><span class="s">"spark.hadoop.fs.s3a.session.token"</span><span class="p">,</span> <span class="s">"..."</span><span class="p">)</span>
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="p">.</span><span class="n">builder</span><span class="p">.</span><span class="n">config</span><span class="p">(</span><span class="n">conf</span><span class="o">=</span><span class="n">conf</span><span class="p">).</span><span class="n">getOrCreate</span><span class="p">()</span>
<span class="n">path</span> <span class="o">=</span> <span class="s">"s3a://path/to/data/"</span>
<span class="n">spark</span><span class="p">.</span><span class="n">read</span><span class="p">.</span><span class="n">parquet</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</code></pre></div></div>
<p>and now everything runs smooth and you can play with PySpark locally.</p>Today I wanted to run some experiments with PySpark in EMR. Since running an EMR cluster is expensive I decided to first try the code on my local machine, and once I know what to try run it on the EMR cluster. It seems pretty straightforward, isn’t it? But it wasn’t…Embracing dumbness2023-09-18T00:00:00+00:002023-09-18T00:00:00+00:00https://alexmolas.com/2023/09/18/embracing-dumbness<h2 id="a-short-bio">A short bio</h2>
<p>My grandmother-in-law is amazing. She studied physics in Spain during the 50s, then she moved to Switzerland to do physics research, and then she moved back to Spain to teach physics and algebra at the university (some of her exercise books are still used today at her university). She also had 4 kids and widowed when she was 40. All of that was during a period when Spain was under a dictatorship, and women weren’t allowed to do a lot of things, e.g.: they needed the approval of a man to open a bank account. So I think it’s safe to say she didn’t have things easy. I truly believe that a movie about her life would be a huge hit.</p>
<p>Since I also studied physics I enjoy a lot talking with her and learning about her experiences. She’s also an avid reader and she always has a lot of book recommendations (which are usually better than the ones pushed to me by typical RecSys). I usually have the opportunity to talk with her once a month, and I always enjoy it.</p>
<p>One might think that, with so much experience and wisdom, this woman would be a person who always has advice or an opinion to share, but instead she never says one word louder than another. She spends her life reading, taking walks near the beach, eating paella on weekends, and spending time with her family. So when she shares a piece of advice you know you have to pay attention. This is the story of the funniest and most practical advice I have ever received from her.</p>
<h2 id="dumbness-as-a-simplifier">Dumbness as a simplifier</h2>
<p>Some weeks ago we were talking about how she manages today’s technology and how isn’t trivial for people from her generation to deal with technologies that are straightforward to younger people. At some point she said</p>
<blockquote>
<p>When I don’t want to do something I act like I’m dumb and I ask for help.</p>
</blockquote>
<p>And after saying it with a completely straight face we both burst out laughing. She, with tears of laughter in her eyes, told me that she wasn’t joking. “I’ve been doing this all my life and it has always worked.” she explained me. You have probably heard similar advice from other people (e.g.: <a href="https://danluucination.github.io/look-stupid/">Willingness to look stupid</a> by Dan Luu, <a href="https://nabeelqu.co/understanding">How To Understand Things</a> by Nabeel S. Qureshi, or <a href="https://nlopes.dev/writing/dont-be-afraid-to-be-wrong">Don’t be Afraid to be wrong</a> among many others) where they argue that you don’t have to worry about looking dumb and ask questions until you fully understand everything about the topic in question. But she meant a completely different thing, the advice was in the opposite direction. The recommendation is to look dumb to avoid learning about something you are not interested in. It’s saying “I don’t care about this and I don’t want to care”. If you don’t want to do something make another person do it for you, and playing like you’re dumb it’s a great way of doing so. For example, every time she needs to withdraw money from her account she goes to the banker and says something like “Look, I’m an old woman, I barely can see, and I don’t understand how this works. Can you give me money from my account?”. The point is she isn’t interested in learning how the ATM works, and she doesn’t want the banker to explain how it works. She only wants her money, and making herself look dumb is the easiest and quickest way to do so.</p>
<p>The first thought I had was that this approach is only useful for old people, or for people that is not interested in learning new skills, and I don’t consider myself in any of these groups. But she has been doing this all her life, not only now that she’s older, and it’s clear she isn’t the kind of person who lacks curiosity. So what are the benefits of this approach? What am I missing? Here I’ve compiled a list of benefits you get from making yourself dumb about things you don’t want to know about.</p>
<ul>
<li><strong>Reduce cognitive load</strong>. Actively refusing to acquire knowledge you don’t need frees space on your mind. For example, there was a team at my last job that was in charge of broadcasting some data but not producing it. After some time broadcasting the data, they became the de facto experts on the meaning of this data, which made the team move slower (meetings, questions, etc.) until at some point they decided to act as if they knew nothing about the data meaning and refuse to ask questions about this topic. Instead, they redirected all the questions to the team that was producing the data.</li>
<li><strong>Things go faster</strong>. Like in the example of the ATM you won’t lose time learning things you don’t want to learn. You can use the knowledge of an expert on the topic to accelerate the process.</li>
<li><strong>You have time to dedicate to other interesting things</strong>. Time is limited, and as so it should be optimized and used for things you enjoy.</li>
<li><strong>It’s extremely easy to do</strong>. Since you don’t care about the topic you don’t care about the opinion other people have about you, so you don’t have to pretend to be smart. You only need to say something like “It’s the first time I do X and I’m a little bit confused. Can you help me with X?”</li>
<li><strong>Maintain focus on your priorities</strong>. It’s more or less known that you can only maintain focus on one thing at a time, removing tasks that distract you from your focus improve your probabilities of succeeding. Also, <a href="https://en.wiktionary.org/wiki/yak_shaving">Yak shaving</a> is a real problem and it’s easier to happen at tasks in which you aren’t an expert. Delegate these tasks to experts on the topic and focus on your priorities.</li>
<li><strong>One man’s trash is another man’s treasure</strong>. Some people truly enjoy topics that completely bore you. Asking them for help about these things is a gift for them.</li>
</ul>
<p>After reading this list one could get the impression that this technique is the definitive one to never have to work again in life. However, if I want to be honest with myself and I want to honor the memory of my grandmother-in-law, I cannot finish the post here. Here is a list of situation in which you don’t want to use this approach.</p>
<ul>
<li><strong>Sometimes you need to be and look smart</strong>. For example, if you are in a professional environment and you are expected to have a certain level of expertise, pretending to be dumb might damage your credibility.</li>
<li><strong>You are the expert</strong>. If you are an expert on the topic it’s probably a better idea to solve your problems yourself. Also, if you’re an expert on a topic you’ll probably enjoy solving the problem, otherwise, you wouldn’t be an expert. This also applies when you have strong requirements about the solution you’re looking for. If you want something well done, do it yourself.</li>
<li><strong>You can get used to it and stop learning</strong>. Learning is an active process in which you force yourself to be in a situation outside your comfort zone. If you always avoid these situations you’ll stop learning, so you need to be careful in which situations you use the approach of acting like a dumb person.</li>
<li><strong>It can be rude</strong>. Asking for the same thing a dozen times could be seen as not polite.</li>
<li><strong>It’s not always free</strong>. A lot of people won’t help you if you don’t pay them, even if you look lost or confused since their income depends on charging you for their services.</li>
<li><strong>You become too dependent on others</strong>. If at some point the person who was helping you is no longer available you’ll have a hard time solving the problem by yourself.</li>
</ul>A short bioDifferent Types of Means2023-09-14T00:00:00+00:002023-09-14T00:00:00+00:00https://alexmolas.com/2023/09/14/types-of-means<p>Some months ago, someone posted some interviewing tips in r/datascience (the original post has been deleted, but you can read it <a href="https://www.reddit.com/r/datascience/comments/w9jl5m/comment/ihvhbpz/?utm_source=share&utm_medium=web2x&context=3">here</a>). The post quickly became a meme, mainly due to misogynistic comments and tips that were far from useful. One of the proposed questions was about the difference between the arithmetic and the harmonic mean. I know both formulas, so I thought I would pass this user’s interview. However, after some reflection, I realized that I don’t really understand the concepts in depth. When could it be useful to use the harmonic mean instead of the arithmetic one? And how do they relate to the geometric mean? Apparently, this is common knowledge, but for some reason, I never came across this information during my studies. So, I’m writing this post for my future self and to better understand the concepts.</p>
<h1 id="summarizing-random-variables">Summarizing Random Variables</h1>
<p>Arithmetic, harmonic, and geometric means are methods to summarize a set of random variables. While you can apply any of these methods to any series of numbers, it doesn’t always make sense to do so since each one has its own meaning and area of application.</p>
<p>For example, if I want to summarize how much I earn each month, I can compute the arithmetic mean of the monthly salaries and call it $x$. Then, I can say, ‘I’ve earned $x$ dollars a month during one year.’ This makes sense since I can substitute the actual salary of each month with $x$, and the total amount of perceived salary would still be the same, i.e., $x \times 12$.</p>
<p>However, if I’m investing in the stock market, I can’t use the arithmetic mean to summarize the monthly returns since I would overestimate the total returns at the end of the year. Instead, I should use the harmonic mean.</p>
<p>In the following sections, I’ll present these means and explain in which cases we should use each one.</p>
<h1 id="arithmetic-mean">Arithmetic Mean</h1>
<ul>
<li>
<p><strong>When</strong>: When the changing values are aggregated linearly, for example, to compute the average salary during a year.</p>
</li>
<li>
<p><strong>How</strong>:</p>
</li>
</ul>
\[\bar{x}_a = \frac{1}{n}\left( \sum_{i=1}^n x_i \right)\]
<ul>
<li><strong>Example</strong>: Following the example from the last section about salaries, we can compute the average salary as $\frac{1}{7}(5 + 5 + 10 + 5 + 10) = 5$.</li>
</ul>
<h1 id="harmonic-mean">Harmonic Mean</h1>
<ul>
<li>
<p><strong>When</strong>: When the changing value is in the denominator, for example, to compute the average changing rate of some variable.</p>
</li>
<li>
<p><strong>How</strong>:</p>
</li>
</ul>
\[\bar{x}_h = n \left( \sum_{i=1}^n \frac{1}{x_i} \right)\]
<ul>
<li><strong>Example</strong>: Speed is defined as the changing rate of distance with respect to time. Therefore, if I travel from $A$ to $B$ at a speed of 100 km/h and back from $B$ to $A$ at a speed of 50 km/h, the distance traveled in both cases is the same (i.e., $\overline{AB}$) but the time spent is different. Then, to compute the average speed, we should use the harmonic mean as $\frac{2}{\frac{1}{100} + \frac{1}{50}} = 67$ km/h, which is different from the arithmetic mean $\frac{100 + 50}{2} = 75$ km/h.</li>
</ul>
<h1 id="geometric-mean">Geometric Mean</h1>
<ul>
<li>
<p><strong>When</strong>: When the changing values are multiplied between them, for example, in exponential growth or compound interests.</p>
</li>
<li>
<p><strong>How</strong>:</p>
</li>
</ul>
\[\bar{x}_g = \left( \prod_{i=1}^n x_i \right)^{1/n}\]
<ul>
<li><strong>Example</strong>: If I’m investing my money and the annual returns are $[5\%, 0\%, 5\%, 10\%, 0\%, 5\%, 10\%]$, then the arithmetic mean is 5%. But if I substitute all the values with 5%, I would overestimate the total returns since $1.05 \times 1 \times 1.05 \times 1.1 \times 1 \times 1.05 \times 1.1 < 1.05^7$. The correct value is given by the geometric mean as $\sqrt[7]{1.05 \times 1.05 \times 1.1 \times 1.05 \times 1.1} = 1.04932$.</li>
</ul>
<h1 id="conclusions">Conclusions</h1>
<p>To be honest, there are not a lot of conclusions to add here, but I’m used to writing this section at the end of every post, so here we are. The objective of this post was to help me understand and clarify these concepts, so I could just keep it to myself, but maybe it can be helpful to someone else in the future (or maybe to the future Alex), so I’ll publish it anyway.</p>Some months ago, someone posted some interviewing tips in r/datascience (the original post has been deleted, but you can read it here). The post quickly became a meme, mainly due to misogynistic comments and tips that were far from useful. One of the proposed questions was about the difference between the arithmetic and the harmonic mean. I know both formulas, so I thought I would pass this user’s interview. However, after some reflection, I realized that I don’t really understand the concepts in depth. When could it be useful to use the harmonic mean instead of the arithmetic one? And how do they relate to the geometric mean? Apparently, this is common knowledge, but for some reason, I never came across this information during my studies. So, I’m writing this post for my future self and to better understand the concepts.How far can you jump from a swing?2023-08-18T00:00:00+00:002023-08-18T00:00:00+00:00https://alexmolas.com/2023/08/18/how-far-can-you-jump<blockquote>
<p>Discussion on <a href="https://news.ycombinator.com/item?id=37313493">HackerNews</a>.<br />
Some people pointed out some flaws in my modelling (eg: assuming zero distance from swing to floor) which I’ve tried to fix. The original maximum distance estimation was around $1m$.</p>
</blockquote>
<p>This summer I’ve spent an absurd amount of time reading and learning about the physics of swings. Yes, you read it right, I’ve been learning about the physical processes that happen when a kid is playing with a swing in the park. Blame it on my kids and the countless hours spent enjoying these moments with them. In particular, I read about the physics of pumping a swing and about the physics of jumping from a swing. Amidst my deep dive into swing physics, I came up with a new Olympic sport in which you start seated on a swing with length $L$, your feet comfortably touching the ground. As a countdown of $T$ seconds commences, you embark on the art of swing-pumping. Your challenge is to execute a skillful leap before the countdown reaches zero. With your jump, you travel a distance $d$ from your initial point, aiming to achieve the greatest possible $d$.</p>
<p>The question is then, which is the best method to maximize $d$?</p>
<p>Before I present you with the answer to the question I’ll summarize the learnings I got from reading about the physics of a swing. As usual, you can find all the code I used for this post in my <a href="https://github.com/alexmolas/alexmolas.github.io/tree/master/notebooks/swing">repo</a>.</p>
<figure>
<img src="/docs/swing/swing_drawing.png" alt="Swing Drawing" width="300" class="center" />
<figcaption class="center">I love this image, and I wish more papers had this kind of picture on them. Image from [^2]</figcaption>
</figure>
<p>Notice that <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">1</a></sup> and <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">2</a></sup> answer a very similar question: what is the optimal time $t$ to jump off, so as to reach farthest. However, these references do not deal with the pumping of a swing, they just assume that you start swinging at some angle $\lambda$ and jump at an angle $\phi$, without pumping the swing at any point. Solving this problem is interesting, but I think it’s more exciting to solve it when the person swinging can control the system. This makes it feel more like a real game that you can play in a park or at the Olympic Games.</p>
<h1 id="pumping-a-swing">Pumping a swing</h1>
<p>There are several papers about the pumping of a swing <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">3</a></sup>, <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">4</a></sup>, and <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">5</a></sup> and much more. In this section, I’ll focus in particular on <sup id="fnref:2:1" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">4</a></sup>.</p>
<p>The model for a swing I’ll use is a rigid dumbbell made up of three masses, suspended by a rigid rod of length $l_1$ attached to the middle mass $m_1$. The distances from $m_1$ to the other masses $m_2$ and $m_3$ are $l_2$ and $l_3$ respectively. The angle of the rod $l_1$ with the vertical is $\phi$ and the angle of the dumbbell with the rod is $\theta$. In the next figure, you can see a diagram of the system</p>
<figure>
<img src="/docs/swing/swing_diagram.png" alt="Swing diagram" width="500" class="center" />
</figure>
<p>The Lagrangian of this system is</p>
\[\begin{align}
\mathcal{L} = & \frac{1}{2} I_1 \dot\phi^2 + \frac{1}{2} I_2 \left(\dot\phi + \dot\theta\right)^2 - l_1 N \dot\phi\left( \dot \phi + \dot\theta \right) \cos \theta \\
&+ M l_1 g \cos \phi - N g \cos\left(\phi + \theta\right)
\end{align}\]
<p>where $M = m_1 + m_2 + m_3$, $N = m_3 l_3 - m_2 l_2$, $I_1 = M l_1^2$, and $I_2 = m_2 l_2^2 + m_3 l_3^2$. Therefore, the Lagrange’s equation for $\phi$ is</p>
\[\begin{align}
(I_1 + I_2) \ddot \phi + M l_1 \sin \phi = & -I_2 \ddot \theta - l_1 N \dot\theta^2 \sin \theta \\
& + l_1 N \dot\theta^2 \cos \theta + N g \sin(\phi + \theta) \\
& - 2 l_1 N \dot \theta \dot \phi \sin \theta \\
& + 2l_1 N \ddot \phi \cos \theta
\end{align}\]
<p>The paper proceeds by assuming the swinger pumps the swing by forcing $\theta(t) = \theta_0 \cos(\omega t)$, where $\omega$ is the natural angular frequency of the pendulum. Then, they show that there are two regimes, one where $\phi < \phi_{\text{crit}}$ where the movement is like an harmonic driven oscillator, and another one where $\phi > \phi_{\text{crit}}$ where the movement is an harmonic oscillator with parametric terms. They then analyze the different regimes and solve their equations. In summary, they show that for small amplitudes the swing follows the equation</p>
\[\ddot \phi + \omega_0^2\phi = F \cos(\omega t)\]
<p>where</p>
\[\begin{align}
&\omega_0= K_0/I_0\\
&I_0=I_1 + I_2 - 2l_1 N (1 - \theta_0^2/4 )\\
&K_0=Ml_1g - Ng(1 - \theta_0^2/4)\\
&F=\theta_0 \left[ (\omega^2 I_2 + N(g - \omega_0^2l_1)(1 - \theta_0^2/8)\right]/I_0
\end{align}\]
<p>The differential equation has the solution</p>
\[\phi(t) = \left(\frac{F}{\omega_0^2 - \omega^2}\right)\left(\cos \omega t - \cos \omega_0t\right)\]
<p>which looks like for small $t$</p>
<figure>
<img src="/docs/swing/phi_t.svg" alt="phi(t)" width="500" class="center" />
<figcaption class="center">Solution for $\phi(t)$, with $F=0.085$, $\omega=2.21$, $\omega_0=2.23$</figcaption>
</figure>
<p>This solution is good enough for our approach since I assume $T$ to be small enough to avoid the swinger pumping the swing to big $\phi$ values.</p>
<h1 id="jumping-from-a-swing">Jumping from a swing</h1>
<p>Now let’s study how should a swinger jump from a swing to maximize the traveled distance. The analysis presented here is based on the work of Jason Cole <sup id="fnref:4:1" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">1</a></sup> and Hiroyuki Shima <sup id="fnref:5:1" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">2</a></sup>. Notice that the naive solution of jumping at $\phi=\pi/4$ is not optimal. For instance, imagine a swing that oscillates in the range $\pm \pi/4$, then it’s clear that jumping at $\pi/4$ is suboptimal since the swinger will start its flight with a speed of zero.</p>
<figure>
<img src="/docs/swing/swing_jump.png" alt="Jumping from a swing" width="500" class="center" />
<figcaption class="center">Diagram showing the situation just after jumping from the swing. Notice that instead of $h$ I'm using $l_1$, and I'm using $h$ as the distance from the swing to the ground, sorry for the confusing notation. Image from [^4]. </figcaption>
</figure>
<p>Notice I’m adding a new variable $h$ that represents the distance from the swing to the ground. This variable is not present on Jason’s blog but it’s on Hiroyuki paper. Once you jump from the swing, the equations of motion for the horizontal and the vertical directions are</p>
\[\begin{cases}
x(t) = l_1 \sin \phi + v t\cos\phi \\
y(t) = h + l_1(1 - \cos \phi) + vt \sin \phi - \frac{1}{2}gt^2
\end{cases}\]
<p>Now, we can compute the total flight time by solving $y(t_{\text{flight}})=0$ and then compute the flight distance as $d = x(t_{\text{flight}})$. The flight time is then</p>
\[t_{\text{flight}} = \frac{v\sin\phi \pm\sqrt{v^2\sin^2\phi+2g(h+l_1(1-\cos\phi))}}{g}\]
<p>now notice that only the positive root has physical meaning (we don’t want negative times), so the distance is</p>
\[\begin{align}
d = & l_1 \sin\phi + \frac{v^2 \sin\phi\cos\phi}{g} \\
& + \sqrt{\frac{2v^2\cos^2\phi(h+l_1(1-\cos\phi))}{g}+\left(\frac{v^2\sin\phi\cos\phi}{g}\right)^2}
\end{align}\]
<h1 id="putting-everything-together">Putting everything together</h1>
<p>We have now all the pieces we need to solve the problem. On one hand, we can compute the swinger angle $\phi$ for any given $t$, and we can also compute the distance that the swinger will travel when leaving the swing at an angle $\phi$.</p>
<p>Notice that to compute $d$ we need to know $v$. In other sources that study this problem, they get $v$ by using energy conservation, however, in our case, we know $\phi(t)$ and we can get the initial velocity after leaving the swing as $v(t) = l_1 \frac{d}{dt} \phi(t)$</p>
\[v(t) = \frac{l_1F}{\omega^2_0 - \omega^2}(\omega\cos\omega t - \omega_0\cos \omega_0 t)\]
<p>Now, putting everything together we have this set of equations</p>
\[\begin{cases}
d(t) =l_1 \sin\phi(t) + \frac{v^2 \sin\phi(t)\cos\phi(t)}{g} + \sqrt{\frac{2v^2\cos^2\phi(h+l_1(1-\cos\phi))}{g}+\left(\frac{v(t)^2\sin\phi(t)\cos\phi(t)}{g}\right)^2} \\
\phi(t) = \frac{F}{\omega_0^2 - \omega^2}\left(\cos \omega t - \cos \omega_0t\right) \\
v(t) = \frac{l_1F}{\omega^2_0 - \omega^2}(\omega\cos\omega t - \omega_0\cos \omega_0 t)
\end{cases}\]
<p>To compute $d(t)$ we just need to compute $\phi(t)$ and $v(t)$ and substitute the values in the first equation. I’ll use the following constants: $M=1$, $m_1=0.4M$, $m_2 = 0.2M$, $m_3 = 0.4M$, $l_1 = 2$, $l_2 = 0.4$, $l_3 = 0.4$, $h=l_3$, $\theta_0=1$, $g= 9.8$, $T=2 \pi \sqrt{l1 / g}$, and $\omega= 2\pi/T$</p>
<p>With these parameters, we can now plot the traveled distance as a function of the jumping time.</p>
<figure>
<img src="/docs/swing/distance_vs_t.svg" alt="Traveled distance as a function of jumping time" width="500" class="center" />
</figure>
<p>Let’s remember that we’re interested in the optimal jumping time $t^*$ for a given maximum time $T$. To do so we just need to fix a time $T$ and find at which $t^* < T$ the distance $d(t)$ is maximized. I did that numerically and plotted the results in the next image</p>
<figure>
<img src="/docs/swing/optimal_time.svg" alt="Optimal jumping time" width="500" class="center" />
</figure>
<p>Of course, the optimal jumping time follows a ladder-like curve. This is because you’re not interested in jumping backward, and sometimes it’s better to jump some seconds before $T$ than to wait for $T$ and find yourself in a worse position.</p>
<p>Finally, we can get also the maximum traveled distance as a function of $T$.</p>
<figure>
<img src="/docs/swing/max_distance.svg" alt="Traveled distance as a function of jumping time" width="500" class="center" />
</figure>
<p>For example, if $T=20s$, which seems like a reasonable value to make the sport interesting, one would expect to achieve $d\approx 2m$.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Well, that’s all for today. In this post, I’ve presented a new Olympic sport that consists of pumping on a swing during a given amount of time and then jumping and trying to achieve the maximum distance. With the parameters used in my simulations, I would expect the world record to be around two meters.</p>
<p>The analysis presented here is full of simplifications. Here I list some of the ones I’m aware of</p>
<ul>
<li>The swinger model is oversimplified. For instance, authors in <sup id="fnref:3:1" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">5</a></sup> present a model which is more accurate than the one used here. However, I wanted to keep the analysis “simple”.</li>
<li>The swinger is assumed to operate in the regime of small $\phi$, which allows us to use an analytical equation for $\phi(t)$. However, an experimented swinger could achieve big oscillation angles in a short amount of time, and then our simplification wouldn’t be valid anymore.</li>
<li>As a good physicist I’ve neglected any kind of friction (swing-rod, swinger-air, etc.).</li>
</ul>
<p>Even with all this simplification, I think the analysis stills bring some light to the problem of maximizing the flight distance. Now, the only step still missing is the experimental one: go to a park and try to beat the theoretical maximum distance. According to my numbers, you wouldn’t be able to beat the 2-meter mark.</p>
<p>All of this reminds me of the anecdote of the mathematician and his wife moving a sofa. The mathematician spent a lot of time computing if it was possible to move the sofa from one room to the other, and finally proved it was impossible. Then he went to show it to his wife, which had already moved the sofa to the other room. So I’m pretty sure that it’s going to be possible to beat my theoretical maximum distance. Unfortunately, I don’t have a swing near me right now, so I’ll have to wait until the next visit to the park.</p>
<hr />
<h1 id="appendix-1-optimal-distribution-of-masses">Appendix 1: optimal distribution of masses</h1>
<p>Some days after publishing this post I started wondering which was the combination of masses $m_1$, $m_2$, and $m_3$ that allowed for the best results, aka how should a Swing Jumping world champion look like. To do so I have fixed all the parameters except the masses, and I’ve forced $1 = m_1 + m_2 + m_3$ since the behavior of the system is independent of the total mass $M$.</p>
<p>In the next plot we see how the maximum distance depends on $m_2$ and $m_3$. The red star shows where the combination of masses that maximized the distance.</p>
<figure>
<img src="/docs/swing/distance_vs_masses.png" alt="$m1$ & $m2$ vs max distance" width="500" class="center" />
</figure>
<p>The best combination of masses is $m_1=0$, $m_2 \approx 0.625$ and $m_3 \approx 0.375$. Of course this is not a solution that’s feasible - maybe . Setting a minimum value for $m_1 > 0.1$ we get a different optimal distribution of masses, ie: $m_1\approx 0.2$, $m_2 \approx 0.5$, and $m_3 \approx 0.3$. So we see that the optimal solution is always to minimize $m_1$.</p>
<figure>
<img src="/docs/swing/distance_vs_masses_clipped.png" alt="$m1$ & $m2$ vs max distance for a clipped value of $m1$" width="500" class="center" />
</figure>
<p>With these new masses the maxiumum distance is around $3m$ which is considerably higher than our first result.</p>
<p>We could also analyze the best combination of lengths $l_*$ and masses $m_*$ that maximize the distance, however I don’t think it’s going to add a lot of value to the study so I’ll leave the analysis as it is now.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:4" role="doc-endnote">
<p><a href="https://jasmcole.com/2021/02/07/swing-and-a-miss/">https://jasmcole.com/2021/02/07/swing-and-a-miss/</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p><a href="https://arxiv.org/abs/1208.4355">How far can Tarzan jump?, Hiroyuki Shima</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:5:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:1" role="doc-endnote">
<p>Pumping on a Swing, Peter L. Tea and Harold Falk, American Journal of Physics, 36, 1165 (1968) <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The pumping of a swing from the seated position, William B. Case and Mark A. Swanson, American Journal of Physics, 58, 463 (1990) <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Initial phase and frequency modulations of pumping a playground swing, Chiaki Hirata et al, (2023) <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:3:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Discussion on HackerNews. Some people pointed out some flaws in my modelling (eg: assuming zero distance from swing to floor) which I’ve tried to fix. The original maximum distance estimation was around $1m$.Analyzing Gender Gap in Chess2023-08-12T00:00:00+00:002023-08-12T00:00:00+00:00https://alexmolas.com/2023/08/12/chess-gender-gap<p>Imagine a race in which 10,000 men and women participate. Before starting the race a random color is assigned to each participant regardless of shape, gender, race, or whatever. There are 9,000 red runners and 1,000 blue runners. If you had to bet, you would probably bet on a red runner, not because you know they are better runners, but because you know they are more likely to win. A similar situation occurs with the gender gap in chess. In this post, I will show that the domination of men in chess can be largely explained by a matter of participation.</p>
<p>While other factors like cultural and societal biases may be at play, the numbers don’t lie. I hope that by shedding light and data on this issue, we can start a conversation about how to promote greater gender diversity and inclusivity in the world of chess.</p>
<h1 id="tldr">tl;dr</h1>
<p>Using data from FIDE and using the methods defined in this <a href="https://cognition.aau.at/download/Publikationen/Bilalic/Bilalic_etal_2009.pdf">paper</a> I:</p>
<ul>
<li>extend the analysis of the paper to all the available countries. You can find all the code and results <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/notebooks/chess-gender-gap">here</a>.</li>
<li>show that the gender gap in chess can be largely explained as a statistical artifact due to participation differences. In more than $90\%$ of the countries, the difference between the top male and the top female players can be explained by the difference in participation.</li>
<li>explain that even if participation imbalances can explain some of the Elo rate differences it’s not enough to explain all of the differences. This indicates that we still have a lot of work to do to reduce these differences and give equal opportunities to everyone.</li>
<li>show that in the paper some math approximations can be improved. I suggest using another approach that gives more accurate results.</li>
<li>show that making normal approximations when you want to compare extreme events is not a good idea. A lot of things are normal around the middle, but not in the tails.</li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>The usual readers of the blog know that I love analyzing games from a statistical/math point of view (<a href="https://www.alexmolas.com/blog/counterintuitive-coin-game/">this</a> and <a href="https://www.alexmolas.com/blog/continuous-blackjack-i/">this</a>). I like also playing chess (<a href="https://www.alexmolas.com/blog/chess-960-initial-position/">this</a> and <a href="https://www.alexmolas.com/blog/mate-with-only-pawns/">this</a>), so today’s post is going to be focused on analyzing some chess data.</p>
<p>The other day -a few months ago- I was watching the World Chess Cup (WCC) and I started wondering why no woman has ever played in a WCC. The highest rating ever achieved by a woman is 2735 by Judit Polgár. Achieving such an Elo is at the hand of very few, however, the highest Elo ever achieved by a man is 2853, around 100 points more, and this difference is not negligible. In other words, Judit Polgár at her peak would be ranked 17th in today’s global ranking. And currently, the highest-rated woman, Hou Yifan, is ranked 55th.</p>
<p>All of this made me start thinking about these differences, and how can they be explained. I’m not a big fan of theories that rely on “intrinsic” differences between men and women to explain this kind of situation. I believe that these differences usually have their roots in sociology and not in biology.</p>
<p>After some googling, I found a lot of explanations about this gap (sociological, biological, theological, etc.), but one of them seemed really interesting. It basically said that there’s no real gap between genders in chess, it’s just a statistical artifact. The idea is that the gap between top players can be completely explained just by the imbalance between women and men playing chess. Two sources are supporting this theory (<a href="https://cognition.aau.at/download/Publikationen/Bilalic/Bilalic_etal_2009.pdf">Bilalic et al.</a> and <a href="https://en.chessbase.com/post/what-gender-gap-in-chess">Chessbase</a>), but both of them only study data of one country (Germany and India respectively), so I wondered if their findings apply to other countries or if it was just an isolated case. The Chessbase post follows the same ideas as in the paper by Bilalic et al.</p>
<p>In this blog post, I’ll review the paper by Bilalic et al. and apply their methods to more countries. I’ll also propose other approaches that I believe are better suited for this kind of data. All the code and results can be reproduced using this <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/notebooks/chess-gender-gap/gender-gap.ipynb">notebook</a>.</p>
<h1 id="methods">Methods</h1>
<h2 id="data">Data</h2>
<p>Data was downloaded from <a href="https://ratings.fide.com/download_lists.phtml">FIDE website</a>. I removed underage players (born after 2005) and inactive players. After cleaning the data I ended up with $114580$ players ($94\%$ male and $6\%$ female), with an average Elo=$1712 \pm 321$.</p>
<h2 id="maths">Maths</h2>
<p>In this section, we will dive into the mathematical approach Bilalic et al used to analyze the gender gap in chess. The methodology is based on the idea that the more players in a group, the higher the chance of having a top-performing player. By comparing the expected and actual differences between the top male and female players, the analysis can determine if the gender gap is due to participation imbalance or other factors. The expected ranking is computed using a formula that takes into account the number of players and the distribution of their ratings. This methodology is grounded in statistical theory and provides a rigorous way to analyze the gender gap in chess.</p>
<p>Concretely, the implementation goes as this</p>
<ol>
<li>Compute the total number of male and female players, $n_{m}$ and $n_{f}$.</li>
<li>Fit the global rating distribution to a normal distribution and obtain $\mu$ and $\sigma$.</li>
<li>Compute the expected $k$ highest value after extracting $n_{m}$ and $n_{f}$ samples from $\mathcal{N}(\mu, \sigma)$, this is $\hat{E}(k, n_{m})$ and $\hat{E}(k, n_{f})$ respectively. Define $\Delta \hat{E}(k) = \hat{E}(k, n_{m}) - \hat{E}(k, n_{f})$ as the expected difference between the $k$-th male and female top players</li>
<li>Compute the actual difference between the $k$-th male and female players $\Delta E(k)$.</li>
<li>Compare $\Delta E(k)$ and $\Delta \hat{E}(k)$. If the difference between ratings is caused by the participation imbalance one would expect these values to be similar.</li>
</ol>
<p>To compute the expected ranking they use the following formula</p>
\[E_{\text{Bilalic}}(k, n) \approx (\mu + c_1 \sigma) + c_2 \sigma \frac{n!}{(n-k)!n^k} (\log n - H(k-1))\]
<p>where $c_1 = 1.25$, $c_2 = 0.287$, and $H(k)$ is the $k$-th harmonic number. The formula relies on some assumptions such as normality and big numbers, and its derivation can be found in the original paper.</p>
<h3 id="criticism">Criticism</h3>
<p>I have some criticism against the methodology explained in the last section, and it’s based on (1) the formula for $E(k, n)$ doesn’t give good results and (2) the normal approximation is not a good model.</p>
<p>I explained the first point in this <a href="/2023/08/12/gaussian-order-statistics.html">post</a>. Since this criticism doesn’t apply directly to this post I won’t talk more about it.</p>
<p>On the other hand, we have the assumption of normality. In the original paper, it doesn’t seem like a bad decision. In the figure below we see the ELO distribution for German players, and it seems that the normal fit is good enough.</p>
<figure>
<img src="/docs/chess-gender-gap/german-distribution.png" alt="indian-distribution" width="300" class="center" />
<figcaption class="center">The distribution of the German chess rating with the best-fit normal curve superimposed. $n = 120399$, $\mu = 1461$, $\sigma=342$, $16 : 1$ men to women ratio. Figure from Bilalic et al.</figcaption>
</figure>
<p>However, for our use case is not a good idea to use a normal distribution. There’s a saying that states <em><a href="https://twitter.com/ProbFact/status/1640809801671233544">many things are normal around the middle but not in the tails</a></em>, and in our use case we’re particularly interested in what happens in the tails because we want to compare the ratings of the best male and female players. Therefore, assuming that the data is normally distributed is not a good idea.</p>
<p>On the other hand, if we look at the rating distribution of other countries it’s obvious that it doesn’t follow a Gaussian. Here we see the distribution for India</p>
<figure>
<img src="/docs/chess-gender-gap/india-distribution.png" alt="german-distribution" width="300" class="center" />
<figcaption class="center">The distribution of the Indian chess rating</figcaption>
</figure>
<p>Following these two arguments, I’ve decided to drop the normal assumption. Instead, I’ll use bootstrapping when needed, ie: to compute the expected $k$ rating after drawing $n$ values I’ll draw $n$ values with repetition from the actual data and compute the average rating of the $k$ rating. This can be done with the following method</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">expected_elo_bootstrapping</span><span class="p">(</span><span class="n">n</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
<span class="n">ratings</span><span class="p">:</span> <span class="n">Sequence</span><span class="p">[</span><span class="nb">float</span><span class="p">],</span>
<span class="n">n_experiments</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span><span class="p">)</span> <span class="o">-></span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">float</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
<span class="n">sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">ratings</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_experiments</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span>
<span class="n">sorted_sample</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sort</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
<span class="n">k_mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="o">-</span><span class="n">k</span><span class="p">])</span>
<span class="n">k_std</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="o">-</span><span class="n">k</span><span class="p">])</span>
<span class="k">return</span> <span class="n">k_mean</span><span class="p">,</span> <span class="n">k_std</span>
</code></pre></div></div>
<h1 id="results">Results</h1>
<h2 id="difference-between-the-best-players">Difference between the best players</h2>
<p>Now let’s study the difference between the best male and female players for each country. To do so I computed the actual difference between the best players of each country and the expected difference due to gender imbalance (using bootstrapping). Notice that the expected difference is computed without using the gender data, to compute it we only need to know how many players are in each group.</p>
<p>The results for each country are plotted below. The blue dot is the simulated difference between the top male and female (with an error bar of 2 standard deviations, ie 95%), and the black dot is the actual difference between the top players. If the actual difference is inside the error bar it means that there’s more than a $5\%$ probability that the difference can be explained by chance.</p>
<div style="text-align:center">
<embed type="text/html" src="/docs/chess-gender-gap/countries.html" width="550" height="1100" />
</div>
<p>In around $92\%$ of the countries, the difference can be explained by participation rates, which means that the gap gender can be explained by chance. I find it interesting that even in Norway -the country of the current best chess player- the gender gap is explainable by participation imbalance.</p>
<p>However, it cannot be ignored that in most cases the observed difference is greater than the expected one. Imho this means that there’s still a lot of work to be done to reduce these differences and give equal opportunities to everyone.</p>
<h2 id="difference-between-all-top-players">Difference between all top players</h2>
<p>In the last section, we compared the results of the top 1 players, but it can be generalized to the top $k$ players in each country.</p>
<p>In the following plots, I show the difference between the top-$k$ players in different countries.<br />
The x-axis of the plots is the rank of the players, the white square is the difference between the players with that rank, and the black square is the expected difference computed via bootstrapping.</p>
<p>The case of India it’s particularly interesting since the expected difference is much higher than the actual difference, which means that Indian women are playing much better than expected, which I believe it’s completely amazing.</p>
<figure>
<img src="/docs/chess-gender-gap/rank-vs-difference-IND.png" alt="ind-rank-vs-difference" width="600" class="center" />
<figcaption class="center">Expected difference and actual difference for each rank $k$ in India</figcaption>
</figure>
<p>In the case of Israel, all the differences can be clearly explained just by population imbalance. This is the country where it’s more clear that the difference it’s just a fabrication.</p>
<figure>
<img src="/docs/chess-gender-gap/rank-vs-difference-ISR.png" alt="isr-rank-vs-difference" width="600" class="center" />
<figcaption class="center">Expected difference and actual difference for each rank $k$ in Israel</figcaption>
</figure>
<p>Then we have cases like Spain, where the difference can’t be explained only by participation rates. In this kind of case, one would need to study with more depth the sociological situation of these countries to understand why women are not developing their chess skills as expected.</p>
<figure>
<img src="/docs/chess-gender-gap/rank-vs-difference-ESP.png" alt="esp-rank-vs-difference" width="600" class="center" />
<figcaption class="center">Expected difference and actual difference for each rank $k$ in Spain</figcaption>
</figure>
<p>If you’re interested in other countries you can use the code <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/notebooks/chess-gender-gap">here</a> to generate the corresponding plots.</p>
<h1 id="conclusions">Conclusions</h1>
<p>In conclusion, the gender gap in chess can be largely explained as a statistical artifact due to the difference in the number of players between genders. By analyzing the data from more than 100 countries, we found that in more than 90% of them, the difference between the top male and female players can be explained by chance. This is contrary to popular beliefs, which attribute the gap to intrinsic differences between men and women. Our analysis is based on a rigorous mathematical approach, which involves using bootstrapping for computing the expected highest ranking for male and female players. We also highlighted a few issues with previous studies that attempted to address this topic, and we proposed an alternative approach that provides more accurate results.</p>
<p>However, it’s also clear from the analysis that the participation imbalance can’t be held as the only responsible for the gender gap in chess. As in any other sociological situation, the root cause is usually complex, and it can’t be usually explained with only one reason. Imho it’s important to avoid reductionism when dealing with this kind. of problems, since the first step to achieving a good answer is to ask the correct question. In this case, the correct question is not “How do we make more women participate in chess?” as if solving this question would solve the gender gap in chess. There’s more to it than just the number of players. We need to look at many things to understand why there’s a difference. Things like how society thinks, how kids are introduced to chess when they’re young, and who they see as examples. People often think chess is mostly for guys, and that affects things too. It’s important to realize that all these factors mix together and are necessary for making chess open to everyone. If we change how people think and make sure everyone feels welcome, we can start to fix the difference between men and women in chess. As we’ve seen there are some countries where the gender gap can be explained only by participation imbalance, which means that the other problems can be solved if tackled properly. In the end, it’s about talent, not just being a guy or a girl.</p>Imagine a race in which 10,000 men and women participate. Before starting the race a random color is assigned to each participant regardless of shape, gender, race, or whatever. There are 9,000 red runners and 1,000 blue runners. If you had to bet, you would probably bet on a red runner, not because you know they are better runners, but because you know they are more likely to win. A similar situation occurs with the gender gap in chess. In this post, I will show that the domination of men in chess can be largely explained by a matter of participation.Expected $k$ highest value from $n$ Gaussian draws.2023-08-12T00:00:00+00:002023-08-12T00:00:00+00:00https://alexmolas.com/2023/08/12/gaussian-order-statistics<p>The other day I was writing about chess gender gap, and at some point I had to compute the expected value for $k$-th highest value after drawing $n$ samples from a Gaussian distribution. The <a href="https://cognition.aau.at/download/Publikationen/Bilalic/Bilalic_etal_2009.pdf">paper</a> I was following used an approximation which I felt it was a little bit weird, so I decided to dive a little bit on the topic. Here are some notes about the results.</p>
<h2 id="bilalic">Bilalic</h2>
<p>The methodology followed by Bilalic is based on the idea that the more players in a group, the higher the chance of having a top-performing player. By comparing the expected and actual differences between the top male and female players, the analysis can determine if the gender gap is due to participation imbalance or other factors.</p>
<p>To compute the expected ranking they use the following formula</p>
\[E_{\text{Bilalic}}(k, n) \approx (\mu + c_1 \sigma) + c_2 \sigma \frac{n!}{(n-k)!n^k} (\log n - H(k-1))\]
<p>where $c_1 = 1.25$, $c_2 = 0.287$, and $H(k)$ is the $k$-th harmonic number. The formula relies on some assumptions such as normality and big numbers, and its derivation can be found in the original paper.</p>
<h2 id="blom">Blom</h2>
<p>At first, I was surprised that the author of the paper had to derive a formula for the $k$ value after $n$ draws from a normal distribution since my intuition told me that this should be a more or less known result. After some googling I discovered this <a href="https://stats.stackexchange.com/a/9007/350686">answer</a>, which proposed the formula</p>
\[E_{\text{Blom}}(k, n) \approx \mu + \Phi^{-1} \left( \frac{k - \alpha}{n-2\alpha+1}\right)\sigma\]
<p>where $\alpha = 0.375$ and $\Phi^{-1}(x)$ is the inverse cumulative distribution function (also known as the quantile function) of the normal distribution.</p>
<h2 id="comparison">Comparison</h2>
<p>It’s clear that the two formulas are very different, so the natural question is to wonder if they agree and which one gives better results. To do so I’ve done a simple experiment: given a normal distribution $\mathcal{N}(\mu, \sigma)$ take $n$ draws from $\mathcal{N}$ and compute the average value of the top $k$ rating. Then compute the Blom and the Bilalic values and plot them with the simulated value. I’ve used $n=10000$, $\mu=1800$ and $\sigma=300$ since these are the typical values we are dealing with. The results are in the following plot</p>
<figure>
<img src="/docs/chess-gender-gap/bilalic-vs-blom.png" alt="blom-vs-bilalic" width="300" class="center" />
<figcaption class="center">Simulated $k$ top value after $n$ draws from a normal distribution and the corresponding Blom and Bilalic values.</figcaption>
</figure>
<p>It’s clear that the Blom formula gives better results for large values of $k$.</p>The other day I was writing about chess gender gap, and at some point I had to compute the expected value for $k$-th highest value after drawing $n$ samples from a Gaussian distribution. The paper I was following used an approximation which I felt it was a little bit weird, so I decided to dive a little bit on the topic. Here are some notes about the results.Good Science, Good Engineering.2023-08-01T00:00:00+00:002023-08-01T00:00:00+00:00https://alexmolas.com/2023/08/01/good-science-good-engineering<p>To do great science you need to do great engineering. I said that a few times during the last few years, and thought writing a post about it several times, but never had time. This last week with the superconducting material events I found myself reflecting about it again, and taking advantage of my summer holidays I took some time to write about it.</p>
<p>Science and engineering are tightly coupled, but usually, we think that one follows the other. Scientists first discover laws about the universe and then engineers build things with these laws. However, I believe that engineering is a critical piece of science and that you can’t do great science if you don’t do great engineering. This might sound weird to you if you came from a scientific background. In my case, I studied physics, and the idea I got during my degree was that physics is altruistic, solely seeking knowledge, while engineering is capitalistic, using that knowledge to generate income (notice how engineers are not even included in this <a href="https://xkcd.com/435/">comic</a>).</p>
<p>However, this view changed after I started working as a data scientist, where -surprise surprise- I had to do science with data. There I discovered that without following best engineering practices it was almost impossible to get good scientific results. Sure, I could run analyses and experiments on my local laptop with hardcoded parameters, poor code quality, and no version control system. But replicating those results became a pain after a few days. Then, after much pain and tears, I realized that to do great science I needed to stick to great engineering practices.</p>
<p>You might think this only applies to data science or computer engineering, but it’s not the case. Let’s study the biggest experiment ever: CERN. I think everyone agrees that CERN is an incredible engineering project. Without great engineering, it would be impossible to accelerate particles almost to the speed of light. Also, high-energy results are usually reported at a confidence level of $5 \sigma$ or more, which means that experiments need to be highly replicable. Without excellent engineering work, it would be impossible to do so.</p>
<p>Another example of good engineering practices is the creation of COVID vaccines. The first version of the vaccine was available only a few days after the outbreak of the virus. This would have been impossible without a great engineering platform. Also, all the tests that were run to determine the safety of the vaccines were an example of good engineering practices. If the vaccine discovery process had not been designed following the best practices, we would still be waiting for the vaccine locked up in our house.</p>
<p>Basically, all I’ve said can be derived from the definition of the scientific method, which roughly consists of four steps: (1) observation, (2) hypothesis, (3) experiments, and (4) analysis. Steps (1) and (2) don’t require reproducibility; you can conceive a hypothesis while skiing (like <a href="https://www.forbes.com/sites/chadorzel/2018/02/06/why-vacations-are-essential-for-physics/">Schrödinger</a>) or even while sleeping (like <a href="https://thesublimeblog.org/2022/08/09/it-came-to-me-in-a-dream-the-intuitive-mathematician-srinivasa-ramanujan/">Ramanujan</a>). No one will ask how you arrived at your hypothesis. Preliminary experiments can be done with less concern about result replicability, with the objective of hypothesis generation. However, once a hypothesis is generated, steps (3) and (4) must be reproducible almost by definition, and achieving reproducibility necessitates adhering to sound engineering practices. Essentially, engineering involves designing processes to achieve specific goals. So, to carry out steps (3) and (4), you need to embrace engineering principles. While serendipity is a huge driver of science, it’s always followed by a meticulous scientific process.</p>
<p>IMHO, this is what has failed with the current superconducting papers and reviews. The authors had to rush to publish their work to avoid having credit taken from them (apparently one of the authors decided to publish the paper without the other knowing it). This lack of time has made it impossible for the authors to clarify the exact methods they followed to generate the superconducting material. Also, the authors said they are only succeeding around 10% of the time to create this material, which makes it even more difficult to explain the method followed. This is what is causing the repetition of the experiments in other laboratories to be giving such different results. I’m not blaming the authors for that, probably in the same situation I’d have done exactly the same, but I’m only using this situation to highlight why good engineering practices are important to do science.</p>
<p>My objective with this rant is to challenge the idea that science somehow surpasses engineering, a misconception I encountered all too often during my study of physics, where it wasn’t uncommon to see physicists looking down on engineers. For the ones that usually deal with engineering processes all this text may sound obvious, however, I know a bunch of people who do “pure” science that will grind their teeth at this perspective. The truth is, both fields are intrinsically interconnected and rely on each other’s strengths for progress. I hope this text can help some “science hooligans” to appreciate the need and beauty of engineering and stop looking down on our science-applied colleagues.</p>To do great science you need to do great engineering. I said that a few times during the last few years, and thought writing a post about it several times, but never had time. This last week with the superconducting material events I found myself reflecting about it again, and taking advantage of my summer holidays I took some time to write about it.You’re the best at something.2023-07-25T00:00:00+00:002023-07-25T00:00:00+00:00https://alexmolas.com/2023/07/25/best-at-something<hr />
<p><strong>Disclaimer</strong>. I wrote this text as a funny mathematical exercise. It’s not meant to be a rigorous derivation. It’s just some “motivational” text inspired by maths, similar to the motivational equation $1.01^{365} \approx 37$. <br />
I’m saying that because every time I write a text like that some people take it literally and say things like “This text doesn’t make any sense, please go back to school and learn some maths”.</p>
<p>Don’t worry about the haters, let them be lost in the fog of confusion while we bask in the glory of math humor.</p>
<hr />
<p>In machine learning, when dealing with a lot of features a non-desired phenomenon arises, the feared “curse of dimensionality.” One of the effects of this curse is that as the dimension increases the volume of a sphere tends to be concentrated near the boundaries. This can be a problem when doing statistics, but in this text, I’ll show how this is a blessing for you since it means that when it comes to skills, you’re unique like a snowflake in a blizzard of possibilities.</p>
<p>Let’s start with some assumptions</p>
<ul>
<li>A person is defined by $N$ different traits. From my experience it’s fair to assume that $N$ is very big, let’s set $N=500$ for the moment.</li>
<li>We can assign to each skill a value between $0$ and $1$, and for the sake of simplicity assume also that the distribution of these values is uniform (probably it’s better to assume that’s Gaussian or a fat-tailed distribution, but let’s keep it simple).</li>
<li>Skills values are independent.</li>
<li>When you’re born you get assigned a value at random for each different trait.</li>
<li>We say that you are very good at some skill if you have a value bigger than the $99\%$ of the population for that trait.</li>
</ul>
<p>With these assumptions, the question we ask is what is the probability of being the best at something?</p>
<p>Since each trait is independent we have a probability of $p=1-0.99=0.01$ of being very good at that dimension. Since we have $N$ dimensions the probability of being very good in at least one of the features is given by</p>
\[1 - (1 - p)^N = 0.9934\]
<p>which means that you’re probably one of the best in the world on at least one thing with a probability of almost $1$, which is pretty cool.</p>
<p>However, you may be better than the rest at more than one thing. For example, you can be a good mathematician and a good football player (like <a href="https://en.wikipedia.org/wiki/Harald_Bohr">Harald Bohr</a>). The probability of being better than the other at exactly $M$ things is given by the binomial distribution $B(M;N,p) = \binom{N}{M} p^k(1-p)^{N-M}$ . But we are interested in the probability of being better than the other on $M$ or more traits. We can get this value by summing for all values $>M$ , ie</p>
\[\sum_{i=M}^\infty B(i;N,p) = 1 - \sum_{i=0}^M B(i;N,p)\]
<p>where we have used that the sum from $0$ to $\infty$ must sum up to $1$.</p>
<p>If we compute this probability for different values of $M$ we get the following curve</p>
<figure>
<img src="/docs/best-at-something/distribution.svg" alt="Probability of being better than the other on $M$ or more traits is" width="500" class="center" />
</figure>
<p>In the image, we see there’s a decent chance of being the best on at least $2$ or $3$ things out of the $500$ possibilities. Maybe it doesn’t sound impressive, but now we’ll see that your specific combination of skills is something rare.</p>
<p>The binomial coefficient $\binom{n}{m} = \frac{n!}{m!(n-m)!}$ can be used to compute the number of ways, disregarding order, that $m$ objects can be chosen from among $n$ objects. In our case we can use it to compute the different combinations of being the best at $3$ things out of $500$ possibilities, ie $\binom{500}{3} \sim 10^{7}$ . This means that there are about ten million different combinations of $3$ skills which imply that your specific combination of skills is pretty unique. Only one out of ten million people have the same combination as you, this is, only a couple of hundred people in the world.</p>
<p>The next question to answer is “So what?”. What can you do now with this information? I’m afraid that there’s not so much you can do with this information. Maybe you’re the fastest at multiplying big numbers, the best at cleaning kitchens, and the best at calming horses. What to do with these super specific set of features is up to you, but if you can find something that combines all these features and gives you money you’re going to be very rich. The problem is, of course, to find a way to monetize or do something useful with this set of skills.</p>
<p>However, even if you can’t get money out of your skills it’s cool to know that you’re not just another one in the world. You’re one of the few members of an exclusive club of “Horse-Calming, Kitchen-Cleaning, Big-Multiplying.” Move over, Avengers!</p>
<p>Finally, let me clarify that this text makes a lot of assumptions that do not hold in reality, and I’m aware of that. For example, it’s known that skills are not independent between them, eg: if you’re good at math you’re going to be good at physics too. Also, I’m assuming that your talents are fixed when you are born and that you can’t do anything to improve them, which is of course false. Also, I decided to set the number of skills $N=500$ using my intuition, which has been proven in the past to be not very accurate. If you know of resources about this topic that point to a better $N$ I would be happy to read it and update the post!</p>Disclaimer. I wrote this text as a funny mathematical exercise. It’s not meant to be a rigorous derivation. It’s just some “motivational” text inspired by maths, similar to the motivational equation $1.01^{365} \approx 37$. I’m saying that because every time I write a text like that some people take it literally and say things like “This text doesn’t make any sense, please go back to school and learn some maths”.512KB Club2023-07-20T00:00:00+00:002023-07-20T00:00:00+00:00https://alexmolas.com/2023/07/20/512KB-club<p>Quick blog update. Since last Tuesday (<code class="language-plaintext highlighter-rouge">2023-07-18</code>) I’m a proud member of the <a href="https://512kb.club/">512KB club</a>. I am aware it doesn’t sound very ambitious, but it serves as a small reward after having completely <a href="/2023/07/13/ugly-blog.html">refactored my blog</a>.</p>
<p>To access this club you only need to check that your uncompressed blog size is less than 512KB. You can do it using <a href="https://gtmetrix.com/reports/alexmolas.com/P69MCIQN/">GTMetrix</a>. Then you only need to open a <a href="https://github.com/kevquirk/512kb.club/pull/1224">PR</a> with the correct format, and in a couple of days your site will be added to the official list.</p>
<p>And then you can add a badge like this one to your page</p>
<p><a href="https://512kb.club"><img src="https://512kb.club/assets/images/orange-team.svg" alt="a proud member of the green team of 512KB club" class="center" /></a></p>
<p>You can add it even if you’re not on the list, I don’t think the police will come after you for that.</p>Quick blog update. Since last Tuesday (2023-07-18) I’m a proud member of the 512KB club. I am aware it doesn’t sound very ambitious, but it serves as a small reward after having completely refactored my blog.Automate your static blogroll.2023-07-20T00:00:00+00:002023-07-20T00:00:00+00:00https://alexmolas.com/2023/07/20/automatic-blogroll<p>During the <a href="/2023/07/13/ugly-blog.html">recent refactor of my blog</a> I implemented an idea that I had been thinking about for a long time: an <a href="https://www.alexmolas.com/blogroll">automatic blogroll</a>. It actually started as an idea of building a custom RSS aggregator for me, but at some point I decided to make it public and visible in my blog.</p>
<p>Here’s why I like my implementation</p>
<ol>
<li>It’s free! everything runs on GitHub (Pages + Actions) so it’s completely free for me.</li>
<li>The blogroll gets updated every 6 hours. I could make it every 15 minutes, but I don’t see the need of that.</li>
<li>Everything is static. No need of databases, API, docker, kubernetes, and all the hell of modern websites. Everything is stored in HTML files that are updated every 6 hours.</li>
<li>Instead of checking every morning Feedly to see if there are new posts on my favourite blogs, I just check my website. Become self-suficient or die.</li>
</ol>
<p>In case you’re wondering, here’s how I did it. The code can be found <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/_tools/build_rss.py">here</a> and <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/.github/workflows/rss.yml">here</a>.</p>
<h2 id="python-script">Python Script</h2>
<p>First of all I wrote a script that received a list of websites and then it retrieved the latests post for each site. The idea is basically</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># build_blogroll.py
</span><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="c1"># Read websites from websites.txt
</span> <span class="n">websites</span> <span class="o">=</span> <span class="n">read_websites</span><span class="p">(</span><span class="s">"_tools/websites.txt"</span><span class="p">)</span>
<span class="c1"># Check for updates
</span> <span class="n">entries</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">website</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">websites</span><span class="p">):</span>
<span class="n">feed</span> <span class="o">=</span> <span class="n">feedparser</span><span class="p">.</span><span class="n">parse</span><span class="p">(</span><span class="n">website</span><span class="p">)</span>
<span class="n">entries</span> <span class="o">+=</span> <span class="n">feed</span><span class="p">.</span><span class="n">entries</span>
<span class="n">write_html_with_updates</span><span class="p">(</span><span class="n">entries</span><span class="p">)</span>
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">write_html_with_updates</code> creates the file <code class="language-plaintext highlighter-rouge">_layouts/blogroll.html</code> with links to the latests post of each site. To avoid having an infinite list I’m filtering out posts older than 30 days.</p>
<h2 id="github-pages--jekyll">GitHub Pages + Jekyll</h2>
<p>On the other hand, my blog is built using the standard combination of Jekyll + GitHub Pages. This means that to publish changes to my blog I only need to commit them to my <code class="language-plaintext highlighter-rouge">main</code> branch and then GitHub will publish them to <code class="language-plaintext highlighter-rouge">alexmolas.github.io</code> (which I’ve redirected to <code class="language-plaintext highlighter-rouge">alexmolas.com</code>).</p>
<p>This makes the writing experience very light, you only need to focus on writing your texts and GitHub takes care of everything else. Also, during the refactor of my blog I cut a lot of extra features, which makes the process of publishing even lighter.</p>
<p>So if I run <code class="language-plaintext highlighter-rouge">build_blogroll.py</code> and then commit <code class="language-plaintext highlighter-rouge">_layouts/blogroll.html</code> to <code class="language-plaintext highlighter-rouge">main</code> I’ll have the updated version of the blogroll. Now all that remains is to automate the process.</p>
<h2 id="github-actions">GitHub Actions</h2>
<p>GitHub Actions allow you to run custom code using GitHub infrastructure without you having to think about servers and so on. This means that I can run the script and then commit the changes using GitHub Actions. In particular, I only need to add this file <code class="language-plaintext highlighter-rouge">.github/workflows/rss.yml</code></p>
<div class="language-yml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">Blogroll Builder</span>
<span class="na">on</span><span class="pi">:</span>
<span class="na">schedule</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">cron</span><span class="pi">:</span> <span class="s2">"</span><span class="s">0</span><span class="nv"> </span><span class="s">*/6</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*</span><span class="nv"> </span><span class="s">*"</span>
<span class="na">jobs</span><span class="pi">:</span>
<span class="na">build</span><span class="pi">:</span>
<span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>
<span class="na">steps</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout repository</span>
<span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v2</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up Python</span>
<span class="na">uses</span><span class="pi">:</span> <span class="s">actions/setup-python@v2</span>
<span class="na">with</span><span class="pi">:</span>
<span class="na">python-version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.10"</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">pip install -r requirements.txt</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Execute Blogroll Builder</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">python _tools/build_blogroll.py</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Commit and push changes</span>
<span class="na">env</span><span class="pi">:</span>
<span class="na">GITHUB_TOKEN</span><span class="pi">:</span> <span class="s">...</span>
<span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">git config --local user.email "user@email.com"</span>
<span class="s">git config --local user.name "GitHub Action"</span>
<span class="s">git add _layouts/blogroll.html</span>
<span class="s">git diff --quiet && git diff --staged --quiet || git commit -m "Update blogroll"</span>
<span class="s">git push</span>
</code></pre></div></div>
<p>and then GitHub will run <code class="language-plaintext highlighter-rouge">_tools/build_blogroll.py</code> and commit the updated <code class="language-plaintext highlighter-rouge">_layouts/blogroll.html</code> every 6 hours, and automagically the changes will be published in my blogroll.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>This is a first version of the code, and I know it doesn’t follow best practices and code standards, but it does the work. My idea is to use this as a base to build an open source RSS aggregator that can be run using GitHub Actions and that’s completely free. If you feel this is something interesting to you feel free to contact me and we can work together on the project :)</p>During the recent refactor of my blog I implemented an idea that I had been thinking about for a long time: an automatic blogroll. It actually started as an idea of building a custom RSS aggregator for me, but at some point I decided to make it public and visible in my blog.About math limitations.2023-07-18T00:00:00+00:002023-07-18T00:00:00+00:00https://alexmolas.com/2023/07/18/math-limitations<hr />
<p><em>This is an old note I took to myself a few months ago when I was learning about computability theory. It was just a couple of bullet points, so I rewrote them in a better format and added some images. Hope you enjoy it.</em></p>
<hr />
<p>During these last weeks I’ve been reading about computability theory. Of all the amazing things I learned, the ones that surprised me more are the ones related with the Halting Problem and Godel’s Incompleteness theorems. The ideas that the almighty maths have some limitations just made my head explode. I mean, I’ve been making heavy usage of maths all my life, first during my Physics degree and after while doing Machine Learning, and they never failed me - usually it was me who failed maths. So just the idea that maths can have some limitation was completely new for me. This first idea that math have some intrinsic limitations - eg: you can’t compute all the Busy Beaver Numbers - make me start wondering if there are other factors that limit maths. In this post I’ll write down the some random ideas I got from thinking about this topic.</p>
<figure>
<img src="/docs/maths-limitations/xkcd-435.png" alt="figure-1" width="500" class="center" />
<figcaption class="center">Maths are usually classified as the purest area of knowledge. But eventhough they are limited by less pure fields.</figcaption>
</figure>
<h1 id="maths-are-limited-by-maths">Maths are limited by maths</h1>
<p>As I said in the introduction, there are some intrinsic limitations of maths. In some of my posts I’ve been talking about the <a href="/2022/10/19/halting-problem.html">Halting Problem</a> and <a href="/2022/10/20/uncomputable-numbers.html">numbers that can’t be computed</a> where it’s shown that not everything is possible in maths.</p>
<h1 id="maths-are-limited-by-physics">Maths are limited by physics</h1>
<p>In the last section, we saw that maths are not as omnipotent as one could though. However, we can go further and say that maths are also limited by the universe where we’re living. The argument goes as follows</p>
<ol>
<li>Our brain is made of matter.</li>
<li>Matter follows physical laws.</li>
<li>Then, our brain follows physical laws.</li>
<li>We use our brain to generate mathematical ideas.</li>
<li>Then, the ideas we can generate are limited by physical laws.</li>
</ol>
<p>Basically, this means that all the mathematical knowledge we have nowadays has been generated by using our limited-by-physics brain, so all this knowledge should be compatible with what our limited-by-physics brain can do, ie: maths are limited by the laws of our universe. Even mathematical ideas generated by computer do need to follow physical laws since computers (even quantum ones) do follow the laws of our universe.</p>
<h1 id="maths-are-limited-by-biology">Maths are limited by biology</h1>
<p>I’ve already made two hot takes, so why stop here? Let’s go even further and make all the mathematical community hate me. So far we’ve seen that since maths need a physical substrate to exist then they should follow physical laws. On the same lines, we can say that maths need a biological substract to be generated, so the ideas we can produce are limited by how humanity have evolved. One could argue that the current neural processes that our brain can perform are the ones that constrain the space of possible ideas.</p>
<p>Another limitation that biology poses to maths is that we only live for a limited number of years. So, at some point, the amount of time you would need to reach the boundaries of math knowledge is going to be larger that the human lifespan.</p>
<h1 id="final-thoughts">Final thoughts</h1>
<p>Let’s stop here since I don’t want to wake up tomorrow and find a mob of mathematicians outside my house ready to burn me in the town square. Here are some final thoughts about this topic.</p>
<p>First of all, at some point while writing this post I wrote the following line</p>
<blockquote>
<p>There’s no pure knowledge, since the substract where the knowledge lies is material and thence not pure.</p>
</blockquote>
<p>and suddenly I realized that someone else already said that a couple of years ago.</p>
<figure>
<img src="/docs/maths-limitations/platos.jpg" alt="figure-2" width="500" class="center" />
<figcaption class="center">Plato already warned us about knowledge limitations in his cave allegory.</figcaption>
</figure>
<p>Even if what I’ve just said has been already known for centuries it’s nice to see that my random ramblings are not that absurd. However, I’m aware that the topic of math limitations has been probably studied much more in depth -and with much more rigor- and that what I’ve just said is probably wrong. If you have some resource about this topic I would be grateful to read it.</p>
<p>Also, after writing this text I asked myself “ok, so what?” Do we care about maths that we can’t even comprehend? As I read somewhere <em>you don’t know what you don’t know</em>, although in this case it should be <em>you don’t know what you can’t know</em>. Maybe we just care about exploring ideas within the boundaries defined by our limited existence.</p>
<p>Furthermore, another thing I noticed is that to show that my theory is right I need to present an idea that can’t be generated by the laws of the universe, but by definition this is impossible since AFAIK I’m still in this universe. So, does the question “are there ideas that can’t be generated in the universe?” even matter?</p>
<p>Let me finish with a quote by Scott Aaronson that exemplifies very well how our limitations shape the world we see the world and the way we create ideas</p>
<blockquote>
<p>Indeed, one could define science as reason’s attempt to compensate for our inability to perceive big numbers. If we could run at 280,000,000 meters per second, there’d be no need for a special theory of relativity: it’d be obvious to everyone that the faster we go, the heavier and squatter we get, and the faster time elapses in the rest of the world. If we could live for 70,000,000 years, there’d be no theory of evolution, and <em>certainly</em> no creationism: we could watch speciation and adaptation with our eyes, instead of painstakingly reconstructing events from fossils and DNA. If we could bake bread at 20,000,000 degrees Kelvin, nuclear fusion would be not the esoteric domain of physicists but ordinary household knowledge. But we can’t do any of these things, and so we have science, to deduce about the gargantuan what we, with our infinitesimal faculties, will never sense. If people fear big numbers, is it any wonder that they fear science as well and turn for solace to the comforting smallness of mysticism?</p>
</blockquote>Nobody cares about your blog.2023-07-15T00:00:00+00:002023-07-15T00:00:00+00:00https://alexmolas.com/2023/07/15/nobody-cares-about-your-blog<p>I started writing on my blog some years ago, and since then I’ve had a lot of reasons to stop writing. Here’s a list of why nobody cares about your blog</p>
<ul>
<li>Your blog is not original. There are hundreds of blogs out there, what makes you think yours is different? You’re just probably repeating things you’ve read in another place.</li>
<li>You’re not an expert in your field, otherwise you wouldn’t be publishing in a blog, but writing papers and giving interviews.</li>
<li>You are only showing the world how stupid you are. If what you say is not better than silence, you better shut up.</li>
<li>If someone, at some point, cares about your blog will be only to criticize it. Your work is trash, and exposing it will make people notice you’re trash as well.</li>
</ul>
<p>But all of these things are not a problem, because you shouldn’t care even a little bit about what other think. Here are my reasons about why you should care about your blog</p>
<ul>
<li>You can use it as notes to your future self. After some years you’ll have a nice journal of how you have evolved over the years. Rereading your old texts is like communicating with your past self.</li>
<li>To release ideas that you have in your head and that you need to get out. Even if nobody else cares about them, writing them down can be a cathartic process.</li>
<li>To learn to write and express complex ideas. The best way to learn is to teach (even if nobody is reading you). As Paul Graham says <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, writing about something, even something you know well, usually shows you that you didn’t know it as well as you thought.</li>
<li>Even if the ideas you’re sharing with the world aren’t original you can enrich them with your personal view. As Bill Thurson said <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> all of us have clear understanding of a few things and murky concepts of many more. There is no way to run out of ideas in need of clarification.</li>
<li>If you have lost time solving a superspecific problem then you need to write about it, it can happen that you end up being someone hero some day. It can also happen that you’ll have the same problem again and then you’ll be your own hero.</li>
<li>It’s cool to maintain a blog, even it’s only from the technical perspective. The feeling of complete ownership over something is really fulfilling, even if it’s just some bytes on a remote server in this ethereal world.</li>
<li>You can say whatever the fuck you want. It’s your blog, you don’t need to follow any rules. I just cursed and you can’t do nothing about it, because this is my blog and I do what I want. This will give you a sense of freedom that’s really cool imho.</li>
</ul>
<p>PS. I was just about to publish this post, and then I started to think “why should I publish this text? Who is going to waste their time to read this crap?”, but you know what? I just don’t care what you think, here’s my post and you can do nothing about it :)</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p><a href="http://paulgraham.com/words.html">Words</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="https://mathoverflow.net/a/44213">Mathoverflow</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>I started writing on my blog some years ago, and since then I’ve had a lot of reasons to stop writing. Here’s a list of why nobody cares about your blogWhy did I make my blog uglier?2023-07-13T00:00:00+00:002023-07-13T00:00:00+00:00https://alexmolas.com/2023/07/13/ugly-blog<p>Last night, I completely revamped my blog. This morning, I showed it to my wife and siblings, hoping they would appreciate the new look. However, none of them liked the updated aesthetics. Despite their opinions, I couldn’t care less. I absolutely adore the new appearance and here’s why:</p>
<ul>
<li><strong>Improved writing experience</strong>: my blog is built on Jekyll, and every time I made a change, it took about a minute to refresh the page. This slow and tedious process hindered my writing flow. However, now the refresh time is less than a second, enabling a much faster and smoother writing experience.</li>
<li><strong>I just have what I need</strong>. Before I had a navigation bar, a search box, light/dark button, and a lot of other things I wasn’t using. Maintaining all these extras became burdensome and slowed down any changes I wanted to make. Now I just have a couple of html templates and that’s all. This gives me a sensation of solidity in the software sense about my blog and I don’t expect to have weird bugs that are difficult to solve due to code complexity. Now I own all the parts of the project, which relieves me and avoids unnecessary worries.</li>
<li><strong>The blog is designed to avoid distractions</strong>. Before the blog was full of colours and buttons, which in my ADHD opinion diverted attention away from the essential elements: the text and ideas. I wanted to create a design that focused on eliminating distractions and keeping the focus on the content.</li>
<li><strong>It’s 100% mine</strong>. Before I was using a standard Jekyll template (<a href="https://github.com/cotes2020/jekyll-theme-chirpy">Chirpy</a>), but now all the code is mine which gives me a strange feeling of satisfaction. It’s entirely my creation, and that alone brings me joy.</li>
<li><strong>The blog is prettier</strong>. Although everyone claims that the blog is uglier, I can’t help but believe that it has become much more beautiful. I love simple aesthetics and blogs direct to the point. Don’t get me wrong, I don’t appreciate simplicity without any thought given to style like <a href="https://www.danluu.com">Dan Luu’s</a> blog, I believe that the right amount of CSS makes the reading experience much better. For example, one of my favourite blogs in terms of style is this <a href="https://gregorygundersen.com/">one</a>.</li>
</ul>
<p>If you like it, feel free to just copy the code from <a href="https://www.github.com/alexmolas/alexmolas.github.io">here</a>. Besides the simplicity and the style, it has some cool features such as an automatic blogroll generation based on your favourite blogs ;)</p>
<hr />
<p>PS. This is how my blog used to look before last night</p>
<div style="text-align:center">
<img src="/docs/ugly-blog/old-blog.png" width="500px" class="center" />
</div>Last night, I completely revamped my blog. This morning, I showed it to my wife and siblings, hoping they would appreciate the new look. However, none of them liked the updated aesthetics. Despite their opinions, I couldn’t care less. I absolutely adore the new appearance and here’s why:A game for the next 15 years: counting license plates2023-07-01T00:00:00+00:002023-07-01T00:00:00+00:00https://alexmolas.com/2023/07/01/counting-license-plates<p>Since last October, I’ve been playing in a simple game all by myself, with just one rule - I have to spot all the license plates from 0000 to 9999 in sequential order. Where I live, license plates follow this format: <code class="language-plaintext highlighter-rouge">XXXX - AAA</code>, where <code class="language-plaintext highlighter-rouge">XXXX</code> represents 4 numbers, and <code class="language-plaintext highlighter-rouge">AAA</code> are three letters.</p>
<p>At the beginning, things got off to a slow start. I was only able to spot the <code class="language-plaintext highlighter-rouge">0000</code> license plate, and it was a bit frustrating. I realised I had to use my numbers game to see what was going on. You see, for each license plate I encountered, I had a probability of $p=\frac{1}{10000}$ of spotting the correct one. My ultimate goal was to observe all 10,000 license plates before my time on this Earth is up - which I predict to be in around 50 years.</p>
<p>Now, figuring out the probability of spotting exactly $k=10000$ license plates out of a total of $n$ observations can be calculated using the binomial distribution equation:</p>
\[B(k; n, p) = \binom{n}{k} p^k(1-p)^{n-k}\]
<p>To succeed, I estimate that an overall success probability of 80% is enough. So, I had to solve $1 - \int_{0}^{10000} B\left(x; n, \frac{1}{10000}\right) dx = 0.8$, which resulted in $n=100840592$. That means I would need to see approximately $10^8$ license plates during my life to have a decent chance of winning this game. In other words, I’d have to spot around 5000 license plates every single day for the next 50 years. Now, even though I love playing games and winning, that seems like a truly uphill task.</p>
<p>So, I decided to take it easy and change the game a bit. Instead of aiming to spot all the license plates from <code class="language-plaintext highlighter-rouge">0000</code> to <code class="language-plaintext highlighter-rouge">9999</code>, I opted to look for plates between <code class="language-plaintext highlighter-rouge">*000</code> and <code class="language-plaintext highlighter-rouge">*999</code>, where <code class="language-plaintext highlighter-rouge">*</code> can be any number. This relaxation allowed me to progress up to <code class="language-plaintext highlighter-rouge">*052</code> as of today.</p>
<p>Since starting in October, I’ve had 55 successful spottings. By using some math magic, I estimated that I’ve probably seen around $55/(1/1000) = 55000$ cars, which amounts to roughly 200 cars every day. With these numbers in mind, my probability of success in the next 15 years is quite high, approximately $1$, thanks to $1 - \int_0^{1000} B(x; 200 \times 365 \times 50, 0.001) \approx 1$ which is good enough for me.</p>
<p>So, there you have it! Using some basic statistics saved me from wasting years of my life pursuing an impossible game to win.</p>
<p>While writing this text I realised that another way to succeed in the original game is to distribute it among $N$ people. I “only” need to convince around $N \sim 5000/190 \sim 25$ people to play this game to have the same winning probabilities as in my current simplified version of the game. What do you say? Do you want to help me?</p>
<p>That wraps up today’s post. Next time, I’ll delve into another fun game involving license plates. Stay tuned!</p>Since last October, I’ve been playing in a simple game all by myself, with just one rule - I have to spot all the license plates from 0000 to 9999 in sequential order. Where I live, license plates follow this format: XXXX - AAA, where XXXX represents 4 numbers, and AAA are three letters.How much for your brain?2023-06-13T00:00:00+00:002023-06-13T00:00:00+00:00https://alexmolas.com/2023/06/13/renting-your-brain<p>In our modern era, we have witnessed a remarkable shift in the way we approach computation. We no longer rely solely on physical machines housed within our premises; instead, we rent computers in the ethereal realm of the cloud. It’s a convenient arrangement that grants us access to computational power without the hassle of maintenance or obsolescence. But the wonders of rental don’t stop there. We have taken a step further and begun leasing machine learning models for predictive tasks, tapping into the immense potential of algorithms to enhance our decision-making. OpenAI and its pre-trained models stand as prime examples of this fascinating trend.</p>
<p>However, amidst this burgeoning rental revolution, another peculiar concept emerges. We pay people to work in our companies, as we’ve done for centuries. Yet, in this age of computation and connectivity, one cannot help but imagine an alternate reality, a dystopian future where instead of selling our workforce to companies, we rent our brains to them. Yes, you read that right. Imagine a world where your cerebral prowess becomes a commodity, and your brain’s computational power is up for lease.</p>
<p>In this surreal future, we connect our brains to a sophisticated interface, akin to the cloud-computing model we have now, for a predetermined duration. Companies, eager to harness our neural prowess, utilize our brain’s computational abilities to crunch numbers, process data, and solve complex problems. And at the end of each working day, we receive monetary compensation for renting out the magnificent, albeit mysterious, workings of our grey matter.</p>
<p>But what determines the value of our rented brains? Ah, that’s the catch. The key to our earning potential lies in the peculiarities of our own minds. The faster our thought processes, the more efficient our computational abilities, the greater our monetary reward. We become the racehorses of the digital age, prized for the swiftness of our mental gallops.</p>
<p>Moreover, just as experience adds value to the machine learning models we lease, so too does it enhance the worth of our rented brains. Think of it as brain training on steroids. The more experiences our minds encounter, the richer our mental tapestry becomes, and the more valuable our computational power. Embarking on a globe-trotting adventure or diving into new and diverse endeavors becomes an investment, fueling the growth of our intellectual capital. The world would become a playground for knowledge seekers, as exploring different cultures and domains transforms from a mere passion to a shrewd business move.</p>
<p>Now, before you dismiss this idea as nothing more than a whimsical flight of fancy, take a moment to reflect on the implications. If the computational power of our brains becomes a rented resource, what would happen to the traditional job market? Would companies cease hiring employees altogether, relying solely on rented brainpower? And what about the delicate balance of work and personal life? Would we become slaves to the digital realm, leasing out our thoughts day in and day out?</p>
<p>Certainly, such a future would present numerous challenges and ethical quandaries. The potential exploitation of human intellect, the erosion of personal privacy, and the dangers of overreliance on technology would all loom large. Yet, it’s crucial to acknowledge the underlying humor in this bizarre vision. After all, who wouldn’t chuckle at the thought of job listings asking for “fast-brained, well-traveled individuals” or the advent of brain rental agencies?</p>
<p>So, dear reader, as we navigate the ever-changing landscape of technology and human interaction, it’s essential to approach these tantalizing prospects with a mix of caution and curiosity. While the idea of renting our brains may seem like a far-flung concept from a science fiction novel, we must remember that our own capacity for innovation often exceeds our imagination’s bounds. With Neuralink’s ambitions inching closer to reality, we would be wise to keep an open mind and a playful spirit as we peer into the looking glass of the future.</p>
<p>And who knows? Perhaps someday, when you find yourself facing a dilemma between a traditional job and renting out your brain, you’ll look back on this peculiar essay and find solace in its whimsy. Until then, let us revel in the fascinating realm of possibilities, where clouds are not just for computers, and our minds possess the potential to reshape the very foundations of labor and productivity.</p>
<hr />
<p>To add more irony to the post let me confess that this essay was completely generated by ChatGPT. I had the original idea about renting brain power, but I wanted to experiment how well a machine learning model could concretise an abstract idea in a text. The quality of the text is good enough, but imagine how good would it be if instead of a machine learning model I had used the brain of a professional writer.</p>
<p>You can see the original prompt and the generated text <a href="https://chat.openai.com/share/e1f14878-d02d-4ab5-9f6b-83b63e52df47">here</a>.</p>In our modern era, we have witnessed a remarkable shift in the way we approach computation. We no longer rely solely on physical machines housed within our premises; instead, we rent computers in the ethereal realm of the cloud. It’s a convenient arrangement that grants us access to computational power without the hassle of maintenance or obsolescence. But the wonders of rental don’t stop there. We have taken a step further and begun leasing machine learning models for predictive tasks, tapping into the immense potential of algorithms to enhance our decision-making. OpenAI and its pre-trained models stand as prime examples of this fascinating trend.Debunking the Myth of Dollar Cost Averaging2023-06-07T00:00:00+00:002023-06-07T00:00:00+00:00https://alexmolas.com/2023/06/07/dca-is-suboptimal<h1 id="introduction">Introduction</h1>
<p>I never received financial education from my family, the only thing they taught me was the age-old adage of spending less than what one earns, but it never went further than that. After moving out of my parents’ home and starting a new life with my beloved wife we started learning about personal finances and how to maximise our hard-earned money. This involved reading books, watching videos, and speaking with experts, but perhaps most significantly, speaking with friends facing similar circumstances, eager to uncover their strategies. In one of these conversations, I was explaining to one friend that every month we were investing some money in S&P 500 and that so far it was going well. He then told me something like “Yes that’s a good idea since you’re reducing the volatility and avoiding the probability of putting all your money at a high point”. He then suggested that were he blessed with a $10,000 fortune, he would gradually deploy it, in modest increments, over time. At the time it seemed like a sensible idea, an approach worth being considered. However, over time, I began to suspect that maybe it wasn’t as right as I thought at first. In this post, I’ll analyse S&P 500 data from the last 40 years and show that dollar cost averaging is usually suboptimal and that investing all the money at once is better.</p>
<p>As usual, all the data and code are available in my <a href="https://github.com/alexmolas/alexmolas.github.io/tree/master/notebooks/sp500">GH repo</a>.</p>
<h1 id="tldr">tldr</h1>
<ul>
<li>Over the last 40 years dollar cost averaging performed worst than a lump sum strategy 82% of the time.</li>
<li>While dollar cost averaging succeeds in reducing volatility, the lump sum strategy typically outperforms it by 23%.</li>
<li>Even tuning the parameters of the dollar cost averaging strategy (frequency and duration) the performance of lump sum is still better.</li>
</ul>
<h1 id="definitions">Definitions</h1>
<p>There is some confusion with the terms I’ll discuss in this post, so I’ll start by defining them.</p>
<ul>
<li><em>Lump sum</em> (LS): This strategy involves investing all the available money right away, without delay</li>
<li><em>Dollar cost averaging</em> (DCA): With this strategy, you invest the available money in fixed intervals, but in smaller portions each time.</li>
<li><em>Systematic investment</em> (SI): This strategy involves investing a small amount of money as soon as possible. Essentially, it’s like making a lump sum investment every month, as a percentage of your salary.</li>
</ul>
<p>The difference between DCA and SI is that in DCA you have all the money available since the beginning, but in SI you have to wait until the next month to have access to the quantity to be invested.</p>
<h1 id="results">Results</h1>
<p>To do this analysis I’ve used daily data for the SP500 index. In particular, I’ve used the closing price of the index. The data availability is from <code class="language-plaintext highlighter-rouge">1980-01-01</code> until <code class="language-plaintext highlighter-rouge">2023-06-01</code>.</p>
<h2 id="monthly-dca-vs-ls">Monthly DCA vs LS</h2>
<p>The first question I try to answer is “When was DCA better than LS over the last 40 years?”. To do so, I’ve computed the performance of an investment over 5 years of (1) a monthly DCA and (2) an LS. I’ve computed these values for all the starting days from <code class="language-plaintext highlighter-rouge">1980</code> until <code class="language-plaintext highlighter-rouge">2018</code>. This is, for each day I assume that I have a fixed amount of money and compute the benefits of DCA and LS over 5 years. The results can be seen in the plot below</p>
<p><img src="/docs/dca-is-suboptimal/dca-vs-ls.svg" alt="DCA-vs-LS" width="500" height="500" /></p>
<p>It’s clear that LS overperforms DCA in the majority of days, but by how much? I computed the percentual difference between LS and DCA and plotted it in the next image. For 82% of the starting days, the LS approach was better than the DCA. In particular, on average, LS made 23% more than DCA.</p>
<p><img src="/docs/dca-is-suboptimal/dca-vs-ls-2.svg" alt="DCA-vs-LS-2" width="500" height="500" /></p>
<p>However, it’s fair to point out that DCA succeeds in reducing the variance (risk) of the investments. In the next plot we see the distributions of returns of both DCA and LS. But, this reduction of the risk comes at a price, and the expected returns are diminished a lot (remember that in 82% of cases LS got better results than DCA.)</p>
<p><img src="/docs/dca-is-suboptimal/distributions.svg" alt="distributions" width="500" height="500" /></p>
<h2 id="fine-tuning-dca-is-it-worth-it">Fine-tuning DCA: is it worth it?</h2>
<p>In the last section, we saw that monthly DCA wasn’t better than LS for investment periods of 5 years. However, the reader might have noticed that DCA depends on two parameters</p>
<ul>
<li>Investment frequency: how many investments do I want to do over the investment period? It’s not the same to invest every month than to invest every week.</li>
<li>Horizon: how long do I want to maintain the investments? It’s not the same to invest over 1 year than over 10 years.</li>
</ul>
<p>In the next two sections, I’ll explore if DCA can be improved by tuning these two parameters.</p>
<h3 id="frequency">Frequency</h3>
<p>Now let’s delve into the influence of investment frequency on the performance of DCA. Investment frequency refers to how often you make investments within the designated investment period. It can vary from monthly investments to weekly or even daily investments.</p>
<p>To analyse the effect of this parameter, I have simulated DCA over a period of 5 years with different investment frequencies and compared them to the LS strategy. In the plot below you can see the percentage improvement of LS over DCA for different frequencies</p>
<p><img src="/docs/dca-is-suboptimal/period-vs-difference.svg" alt="period-vs-difference" width="500" height="500" /></p>
<p>The difference increases monotonically with the number of investments, which indicates that the fewer investment splits one makes the better, ie: if it’s possible invest everything as soon as possible.</p>
<h3 id="horizon">Horizon</h3>
<p>Finally, let’s explore the impact of the investment horizon on the efficacy of DCA. The investment horizon represents the duration over which you maintain your investments. It could be as short as one year or as long as ten years, for instance.</p>
<p>In the next image, I plot the difference between a monthly DCA and LS for different investment lengths. Notice that the difference increases linearly with the investment duration.</p>
<p><img src="/docs/dca-is-suboptimal/length-vs-difference.svg" alt="length-vs-difference" width="500" height="500" /></p>
<p>This means that the performance of DCA in comparison to LS gets worse as more time passes, indicating again that LS is the best strategy to follow.</p>
<h1 id="conclusions">Conclusions</h1>
<p>In this post, I’ve shown that DCA is rarely a good investment strategy. This is because when one decides to follow DCA is implicitly expecting that the market to fall in the recent future. However, our experience - at least as far S&P 500 is concerned - tells us that this is not what happens.</p>
<p>The main conclusion of this post is then <strong>invest all you have as soon as you can</strong>, which is very similar to the well-known adage the best day to start investing was yesterday.</p>
<h1 id="appendix">Appendix</h1>
<p>As pointed out by Marcel in <a href="https://www.linkedin.com/feed/update/urn:li:activity:7072948689777958912?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7072948689777958912%2C7072961217580933120%29">Linkedin</a> these results are derived for the S&P 500 index which has been mostly up since its beginning, and it’s legit to ask if these results apply to other indices such as Ibex35 or Nikkei. Here I add the plots for these two indices and their numbers</p>
<h2 id="ibex-35">Ibex 35</h2>
<p>Ibex 35 is a market capitalization weighted index comprising the 35 most liquid Spanish stocks. The results are plotted in the next figure</p>
<p><img src="/docs/dca-is-suboptimal/dca-vs-ls-2-ibex.svg" alt="length-vs-difference" width="500" height="500" /></p>
<p>Here LS was better than DCA on 53% of the cases, and it yield 12% more returns. The results point to LS as the winning strategy, however the difference is not as sharp as before.</p>
<h2 id="nikkei">Nikkei</h2>
<p>Nikkei is an index that measures the performance of 225 large, publicly owned companies in Japan. The results are plotted in the next figure</p>
<p><img src="/docs/dca-is-suboptimal/dca-vs-ls-2-nikkei.svg" alt="length-vs-difference" width="500" height="500" /></p>
<p>Here LS was better than DCA on 69% of the cases, and it yield 16% more returns.</p>
<h2 id="dax">Dax</h2>
<p>Dax is a stock market index consisting of the 40 major German blue chip companies. The results are plotted in the next figure</p>
<p><img src="/docs/dca-is-suboptimal/dca-vs-ls-2-dax.svg" alt="length-vs-difference" width="500" height="500" /></p>
<p>Here LS was better than DCA on 77% of the cases, and it yield 21% more returns.</p>IntroductionA randomized voting strategy2023-05-23T00:00:00+00:002023-05-23T00:00:00+00:00https://alexmolas.com/2023/05/23/randomized-voting-strategy<p>I think it’s safe to say that nobody agrees completely with any political party. Our reality today is enough complex to make it impossible to have a party that represents your beliefs in absolutely all cases. We might agree on economic points but differ on sociological issues, or we may not even agree with all the economic positions taken by our chosen party.</p>
<p>However, when voting you’re asked a binary question “Do you want to trust this party?”. This makes the voting process a burden since you need to disregard a lot of your views and put all your voting power in just one party. I know a lot of people that have decided to stop voting because of that. They don’t trust any party, and therefore they do not want to be complicit in making them come to power.</p>
<p>Also, this voting process gives results that are not representative of what people really think. Consider a scenario where a country has four parties and all its citizens share the same beliefs. Let’s say people agree with political party $A$ in $28\%$ of their positions, and with the rest of the parties an $18\%$. Since they must to answer a binary question they would all vote for party A. Thus, the distribution of political representatives would be $A: 100\%$, $B: 0\%$, $C: 0\%$, $D: 0\%$. This creates the illusion that party $A$ represents faithfully the beliefs of all the people in a country, but this is not true, and this is a problem. So, how can this be solved?</p>
<p>Here, I propose a voting process that can help with that problem.</p>
<ol>
<li>Give a score to all the parties that are presented to the elections.</li>
<li>Normalise that score, ie: the sum of the score should sum up to $1$. This score can be interpreted as a probability.</li>
<li>Using the normalised score as a probability choose a party at random.</li>
<li>Go and vote for the selected party.</li>
</ol>
<p>If everyone followed this approach the results in the elections would be more faithful to what people really think. In the above example, the results would be $A: 28\%$, $B: 18\%$, $C: 18\%$, and $D: 18\%$, which represent exactly the beliefs of the people in that country.</p>
<p>I know that the process I described above can be a little bit cumbersome (<a href="https://stackoverflow.com/questions/13047806/weighted-random-sample-in-python">who the hell knows how to randomly sample with weights from a set of elements???</a>). The process can be simplified as</p>
<ol>
<li>Go to the voting station.</li>
<li>For each political party, take ballots based on how much you like that party (the more you agree with the party the more ballots you take).</li>
<li>Shuffle the ballots and select one of them. Here you can decide if you want to see which party you have selected.</li>
<li>Vote for the selected party.</li>
</ol>
<p>With this methodology, you accomplish at least two significant outcomes. First, the election results align faithfully with people’s beliefs, ensuring a more accurate representation. Second, it relieves the burden of having to select just one party for voters who opt not to view their selected ballot, allowing them to remain unaware of their choice.</p>
<p>Of course, this method has some problems. The most obvious is that assigning a certain probability to a party does not guarantee that they will prioritise the specific points you agree with. For example, if you agree with a party’s economic agenda but not their social views, you might assign a 50% probability to that party, but you don’t have any guarantee that this party will prioritise economic aspects over social ones. However, I still believe that this is a better approach than just putting all your vote in one party.</p>
<p>PS. I haven’t designed this method on my own. To be honest I know at least two persons that more or less use this method of voting. One of them chooses at random between a far-right and a far-left party, which is an extreme version of this process.</p>I think it’s safe to say that nobody agrees completely with any political party. Our reality today is enough complex to make it impossible to have a party that represents your beliefs in absolutely all cases. We might agree on economic points but differ on sociological issues, or we may not even agree with all the economic positions taken by our chosen party.How to initialize your bias.2023-02-24T00:00:00+00:002023-02-24T00:00:00+00:00https://alexmolas.com/2023/02/24/bias-initialization<h1 id="tldr">tldr</h1>
<p>Initializing correctly the bias of the last layer of your network can speed up the training process. In this post, I show first how to derive analytically the best values for the biases, and then I run an experiment to show the impact of using the correct bias.</p>
<p>In particular, the best biases are</p>
<ul>
<li>Classification problem with $M$ classes with frequencies $F_i$, such that $\sum_j^M F_j = 1$, using softmax activation and categorical cross entropy loss</li>
</ul>
\[b = \begin{pmatrix}
\log F_1 \\
\log F_2 \\
... \\
\log F_M
\end{pmatrix}\]
<ul>
<li>Regression problem using $L^2$ penalization and linear activation</li>
</ul>
\[b = \text{mean}(y)\]
<ul>
<li>Regression problem using $L^1$ penalization and linear activation</li>
</ul>
\[b = \text{median}(y)\]
<h1 id="motivation">Motivation</h1>
<p>These last weeks at work I’ve tuned a neural network that is used to predict arrival times. Basically, the network receives a representation of Stuart’s platform state (where are the drivers, where are the packages, etc.) and outputs the estimated time of arrival of some drivers. We decided to use a deep learning approach to avoid doing boring and unmaintainable feature engineering, but the problem then was to choose the model architecture. If we were solving an image classification problem it would have been trivial to design the architecture, in fact, we wouldn’t need to design anything, just take ResNet50 and fine-tune it. However, our problem is not standard in the deep learning world, so we couldn’t rely on pre-trained models or copy the architecture of previously successful models. We ended up defining an architecture based on convolutions, self-attention, and some dense layers here and there. The results were pretty good -it beat the previous model by +30%- and the model was deployed and everyone was happy.</p>
<p>However, not everything is always that easy, and at some point, we noticed that our model was overfitting. This wasn’t surprising since the model architecture and training process was never tuned. We just took our initial idea, run some experiments, changed some hyper-params by hand and called it a day. But now that the model is deployed and the stakeholders are happy we are working on tuning the model and making it more competitive. To do so I started with the great post by the great Karpathy <a href="http://karpathy.github.io/2019/04/25/recipe/">here</a>. It’s not the first time I read it, but this time one of the points called specially my attention.</p>
<blockquote>
<p><strong>verify loss @ init</strong>. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure <code class="language-plaintext highlighter-rouge">-log(1/n_classes)</code> on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.</p>
</blockquote>
<p>What does Karpathy mean by verifying that your loss starts at the correct value? How can we achieve the <code class="language-plaintext highlighter-rouge">-log(1/n_classes)</code> loss on a softmax? Which are the respective initializations for L2 regressions, Hubber, etc? In this post, I’ll show how to initialize the network to fulfil these requirements and their implications.</p>
<h1 id="problem-statement">Problem statement</h1>
<p>We want to solve the problem of</p>
<blockquote>
<p>Which is the best initialization scheme for our network layers?</p>
</blockquote>
<p>This is a broad question and has been addressed in a lot of works, such as Glorot and He (add references). In these works, the authors initialize the weights of the layers by sampling from a distribution with some optimized parameters. For instance, Glorot proposes to sample from $\mathcal{N}\left(0, 2/(n_i+n_o)\right)$ and He proposes to sample from $\mathcal{N}(0, 2/n_i)$. The common thing between these approaches is that the mean of the distribution is $0$. However, these works focus on the initialization of the weights of all the matrices of our network, while Karpathy talks only about the initialization of the last layer. Then, instead of solving the general question about how to initialize all the layers of the network, I will address the simplified problem of</p>
<blockquote>
<p>Which is the best initialization scheme for the last layer of our network?</p>
</blockquote>
<h1 id="solution">Solution</h1>
<p>In this section, I will answer the above question for several deep learning architectures.</p>
<h2 id="classification">Classification</h2>
<p>Let’s start with a classification problem. We can define a neural network of depth $N$ as a set of weight matrices ${W_1, W_2, …, W_N}$, a set of biases ${b_1, b_2, …, b_N}$, a set of non-linear activations ${f_1, f_2, …, f_{N-1}}$ and a final activation layer $f_N = \text{softmax}$. Where $W_i \in \mathbb{R}^{d_{i} \times d_{i+1}}$ , $b_i \in \mathbb{R}^{d_{i+1}}$ and $f_i: \mathbb{R}^{d_{i+1}} \to \mathbb{R}^{d_{i+1}}$ . Then the output of the network is defined by the recurrence</p>
\[\begin{align}
s_i &= W_i x_{i} + b_i\\
x_{i+1} &= f_i(s_i) \\
\end{align}\]
<p>where $x_0$ is the input of our network. Now, we are interested in the input to the last layer, ie the $\text{softmax}$ layer. As we saw before, the usual initialization uses a normal distribution with mean zero, therefore, if our input of the network is standardized (ie: it has mean zero) we can expect the input to the last layer to have an average $W_N x_N = 0$. Therefore, the output of the last layer has the form</p>
\[o = \text{softmax}(b_N)\]
<p>Cool, we have our first result, let’s see now how can we use this to optimize the initial values of $b_N$. The standard approach to classification problems is to use the cross-entropy loss</p>
\[\mathcal{L} = - \sum_i^M y_i \log \hat y_i\]
<p>where $M$ is the number of classes of our problem. Therefore, if we want to minimize the expected loss at initialization our best guess is to set $b_N$ such that the network output follows the same distribution of our data. This is if our training dataset has $M$ classes that appear with frequency $F_i$ such that $\sum_j^M F_j = 1$ we would like $\hat{y}_i = F_i$, ie the prediction for the $i$ class has the same probability as in the training dataset. With such an output the expected loss is then</p>
\[\begin{align}
\mathbb{E}[\mathcal{L}] &= -\sum_i^M \mathbb{E} [y_i \log \hat y_i] \\
&= -\sum_i^M \mathbb{E}[y_i] \log \mathbb{E}[\hat y_i] \\
&= -\sum_i^M \mathbb{E}[y_i] \log \mathbb{E}[\hat y_i] \\
&= -\sum_i^M F_i \log F_i
\end{align}\]
<p>where we have used that at initialization $y_i$ and $\hat y_i$ are independent, so we can write $\mathbb{E}[y_i \log \hat y_i] = \mathbb{E} [y_i] \log \mathbb{E} [\hat y_i]$. Notice that if the problem is balanced then we have $F_i = 1/M$ and the expected loss at initialization is $\mathcal{L}=-\log 1/M$ as Karpathy says.</p>
<p>Nice, now we know which value to expect for the loss for a correctly initialized last layer, but now we need to know how to set $b_N$ such that the output has the same distribution as the training dataset. To do so we can use the definition of $\text{softmax}$</p>
\[\text{softmax}_i(x) = \frac{\exp x_i}{\sum_j \exp x_j}\]
<p>Now, using that the last layer is $\text{softmax}(b_N)$ we can write our constraint as</p>
\[F_i = \frac{\exp b_i}{\sum_j \exp b_j}\]
<p>which has the solution</p>
\[b_i = k\log F_i\]
<p>Therefore, setting $k=1$, the optimal initialization bias for our last layer has the form</p>
\[b_N = \begin{pmatrix}
\log F_1 \\
\log F_2 \\
... \\
\log F_M
\end{pmatrix}\]
<h2 id="regression">Regression</h2>
<p>In the last section, I’ve shown how to derive the optimal biases at initialization for a classification problem. In this section, I’ll show how to do the same for a regression problem. The main differences between these approaches are (1) the loss we are using, (2) the last layer activation, and (3) the dimension of the output. In regression, the output is usually 1-dim, ie: we’re just predicting one value, so $b_N \in \mathbb{R}$, and the last layer activation is just the identity. The more frequent these problems are</p>
\[\mathcal{L}_{\text{MSE}} = (y_i - \hat y_i)^2\]
<p>and</p>
\[\mathcal{L}_{\text{MAE}} = |y_i - \hat y_i|\]
<p>Using the same rationale as before, we want to minimize these losses at initialization. It’s known that without any further information, the value that minimizes MSE is $\textrm{mean}(y)$ and the value that minimizes MAE is $\textrm{median}(y)$. Therefore, since the output of the last layer is just $o = b_N$ (we’re using the identity activation) we have that the values that minimize loss at initialization are</p>
\[b^{\text{MSE}}_N = \text{mean}(y)\]
<p>and</p>
\[b^{\text{MAE}}_N = \text{median}(y)\]
<p>The expected loss at initialization for the MSE is then the variance since</p>
\[\mathbb{E}[\mathcal{L}_\textrm{MSE}] = \mathbb{E}[(y - \text{mean}(y))^2]= \text{Var}[y]\]
<p>and for the MAE</p>
\[\mathbb{E}[\mathcal{L}_\textrm{MAE}] = \mathbb{E}[|y - \text{median}(y)|] = \frac{1}{n} \sum_i^n |y_i - \text{median}(y)|\]
<p>which I don’t know if it has a specific name.</p>
<p>In the original post, Karpathy says that you can also find the optimal values for the Hubber loss, however, unlike with MAE and MSE there’s no closed form for the value that minimizes the Hubber loss (<a href="https://stats.stackexchange.com/a/298336/350686">explaination here</a>). However, we could obtain the value that minimizes the Hubber loss for our dataset numerically and then use it as the bias of our layer.</p>
<h1 id="results">Results</h1>
<p>In the previous sections, I explained how to determine the best initial bias through mathematical analysis. However, in the real world, things are not always precise, and data can show that our assumptions were incorrect. In this section, I will conduct some experiments to see the impact of initializing biases correctly.</p>
<p>To conduct these experiments, I will use the CIFAR-10 dataset. I have made the problem unbalanced by sampling each class. Then, I created two CNN networks: one with the optimal bias strategy defined above and another with the standard initialization. You can find the code used to generate the models and datasets in this <a href="https://github.com/alexmolas/alexmolas.github.io/blob/master/docs/optimal-biases/Optimal%20biases.ipynb">notebook</a>.</p>
<p>The results are summarized in the following plot. We can see that the network with the optimized initial bias learns faster than the one with the normal network. This effect disappears if we train the network for a sufficient number of epochs. However, training large models is often costly. Therefore, if we can save time and money by setting the correct bias, it is worthwhile.</p>
<p><img src="/docs/optimal-biases/opt-vs-normal-loss.svg" alt="loss" width="300" height="300" /></p>tldr