## Coronavirus and Hosting Conferences Remotely

Covid-19 has been spreading rapidly since it was detected in December in Wuhan, China. I didn’t anticipate in January, when I first heard the news, that schools and offices would be closed in the US, Italy would go on to put the country under quarantine, travel between the US and Europe will stop, and stock markets will suffer greatest dip since 1980s. This is one thing about exponential growth that human intuition is often wrong. Current estimates for the reproduction factor (average number of people infected by a person) of Covid-19  is 2.56 which is greater than 1. This means the disease can rapidly infect a large number of people at an exponential growth rate. Conclusion, the disease is quite contagious. Further, so far the mortality rate from Covid-19 seemed to be much higher than normal flu, hence we should not mistake it for just another flu. Therefore, it is important for people to take precautions. Even if you value working from office/school or enjoying outside, you run the risk of catching the disease and spreading it unknowingly, possibly to older people or sick people who might experience fatal symptoms. This is one reason why young and healthy people should not dismiss Covid-19 lightly.

Many universities have moved their classes fully online for the remaining Spring semester, companies including Microsoft are recommending people in the affected areas (e.g., Seattle, North-East US) to work from home. Conferences are moving online too. New York Academy of Sciences (NYAS) just announced that they will hold their conference virtually (probably the first in their history) and ICLR, an important machine learning conference, will be held as a webinar.

I feel this is a great idea. Virtual classes and conferences are something that shouldn’t have needed a pandemic to happen. However, some people have apprehensions about virtual meetings and conferences that I want to dispel in the remaining post.

## Hosting (Machine Learning) Conferences Remotely

This section is specific to machine learning conferences but might apply more generally. A conference has two formal aims:

1. In-person peer review, where people would ask authors question about their paper.
2. Exchange of ideas and fostering of collaborations

Both of these can happen virtually. Machine learning conferences consist of four things: (i) an oral session where a selected group of papers are orally presented, (ii) poster sessions where paper are presented as posters, (iii) informal chit-chat and food, (iv) panel sessions.

### Hosting Oral Talks

This is quite easy. Imagine a software like  Microsoft Teams or Google Hangout. Everyone joins the presentation from wherever in the world they want. They can see the slides, go back to the previous slides if they want, sync again the current one, record and see it back. During the QA session, people can ping the organizer with questions and random selection will pick 2-3 questions for each talk. No big deal!

### Hosting Poster Talks

Poster sessions require a bit more work since many poster sessions take at the same time. One solution is to create a login for people where they can see all the poster slides at once. They can then click on a poster and they will be taken to the video session where the author is discussing that poster.

### Panel Discussion and Chit-Chat

Panel discussions take place as oral talks. A common webinar will start where a preselected group of people have audio access and they can discuss a topic. People can post questions and moderators can select which questions to ask. For chit-chat, people can create public (or private) groups where others can join (or be invited) in a fixed time slot, to discuss a topic or just for banter. Sessions should be recorded to ensure no harassment occurs although private groups should not be made public or investigated unless there is a charge. Private discussions should be deleted after a week’s time.

## Why Virtual Conferences are actually better

My main emphasis of this post is that virtual conferences should not be viewed as a sad alternative to a physical conference in case of a pandemic, but in fact should be viewed as a viable alternative which has many advantages over a physical conference. I list the main advantages below.

### 1. People from countries with weak passports do not suffer

If you are a Swiss national from a Swiss university, planning to go to ICML, you sit in an airplane a day before the conference and you fly there. You probably have never heard of visa or you get one when you land. However, if you are from a country with weak passport such as China and India (both countries have heavy attendance at conferences), then you probably first fly to another city to apply for visa. You generally pay $100-$300 for the visa fee and only if you get it do you get to attend the conference. Further, many people have travel restrictions like they cannot travel outside without renewing their visa for their host country etc. So many people end up skipping conferences.

### 2. Conference Registration Fee will go Down

Conferences are a rich person’s game, if you are not reimbursed. Machine learning conferences charge between $500-$1000. They are generally held in touristy places, e.g., ICML 2016 was held in Times Square (why?). This means they have to charge more fee to recover the cost.

Many professors have a simple rule: you do not get reimbursed for a conference if you do not have an accepted paper. This means students will have to pay from their pockets. Now think what that means for a student from a poor country who wants to get into machine learning. This essentially stifles knowledge dissemination.

### 3. Many More People can Attend Conference

A virtual conference should cost $0. Companies will be more than happy to provide conference softwares. I believe a small fee of$20 will be charged just to discourage spammers from joining the call and bringing down the system. But this means essentially many more students can attend the conference and interact with people. Again this is different from talks being made offline. Virtual interactive sessions are not the same as offline non-interactive content.

### 4. Oral Talks and Poster Sessions are More Accessible

I do not see any advantage of an in-person oral talk over a virtual session. If anything, not having to stare at a tiny screen from the back of a room is actually a good thing. You can look at the slides clearly, you can go back to a slide, go forward, pause, resume etc. This is much better, right?

Poster sessions are also benefitted. I remember that when doing posters, I would dread a spot closer to the toilet or in the corner. People would, therefore, skip talks to rush to the poster area to put their posters at the preferred spot. With virtual session that is a thing of the past 🙂

### 5. More Convenient for People

You do not have to travel: people with travel anxiety, differently abled people, or people with kids or sick partner feel more convenience. I feel newcomers will also be more at-ease asking questions virtually than having to make their way through a huge crowd to reach the microphone.

Overall, I have always questioned the material benefit of spending $2000, a whole week of traveling, to give a 5min talk or a 1hr poster session that happens at the same time as dinner. Sure meeting friends and exchanging ideas in-person is great but this should not be compulsory and not a reason to inconvenience a good number of people and to stifle knowledge dissemination. Nevertheless, I feel there is some cost to removing in-person meeting. This can be more than remedied by either hosting a physical conference and virtual conference together with people deciding what they want. Or, one can have many workshops around the area where people can meet each other. ## Academia and Compute-Intensive AI Research Artificial Intelligence research (AI) has achieved great empirical success in the last decade. One major reason has been the availability of fast GPU and CPU clusters that have made it possible to run experiments on very large scale (e.g., training models on millions of documents, or training an image classifier on hundred thousand images). Some recent AI directions (particular in architecture search) have been using notoriously large amount of resources. For example, Strubell et al. 2019 found that training a language model produces the same carbon emission as taking a trans-American flight. While it is an interesting research direction to create computationally-efficient algorithms, the state at the end of 2019 is still that beating state-of-the-art (SOTA) justifies it all. Getting an empirical paper accepted without SOTA results is generally hard, and my speculation is that reviewers do not care as much about carbon emissions and FLOPs spent as improvement over state-of-the-art. However, an interesting and overlooked side-story to this is that not everyone has benefited from this resource heavy research. Research labs, particularly those in academia, without connections to well-funded industry, have massive disadvantage in playing this SOTA game. A student in a randomly chosen academic lab will have to use a small shared GPU cluster to play the same number game as a person in industry with access to a cluster of 100 GPU machines. This is generally less of a concern for top-tier schools where many professors are well connected to industry (or otherwise well-funded) and therefore, are in a better position to provide computational resources for their students. It is relatively easy to dismiss these concerns as unwarranted— that a good research should come from well-founded intuition rather than heavy computational use. I believe this sentiment is true in the long-run. Maybe in 10 years, when we understand the black-box nature of deep learning, we will be able to derive these models on whiteboard or find them in an ultra-fast way using basic principles – ResNet, LSTMs, Transformers will then magically appear as optimal solutions of some functional. That would be a great achievement. However, until that time comes, the problem remains as following: • students in empirical research need to publish • current reviewing practice such as use of leaderboards heavily favour beating SOTA • current deep learning methods are not well understood to allow a whiteboard derivation of solutions. E.g., we don’t have a proof why ResNet work better. The need for residual connection is not apparent from a formal whiteboard argument. • students from underfunded labs are at disadvantage ### What can underfunded labs do? 1. Do theoretical research or focus on understanding existing empirical solutions One can avoid the computation-driven try-and-fail rat race by focusing on theoretical questions or understanding existing solutions. This is relatively under-explored and plenty of open questions remain that likely won’t be solved anytime soon. 2. Think of existing problems in a different way The community as a whole can get tunnel vision, resulting in a significant number of researchers working on the same problem. I remember the time when 3-4 similar papers came from well established labs concurrently on the same problem. Working on a different problem or analyzing the same problem in a different way (such as by questioning the assumptions or changing the setup) are advantageous to get out of this tunnel vision. There is more to contribute to the literature compared to solving a problem which will surely be solved by 10 other people in the next few months. 3. Focus on the learning problem, or data-poor problems Focus on sample-efficient learning and/or work with setups where one cannot simply scale the data (due to annotation cost, privacy or other reasons). 4. Reach out to industry for resources Research lab in universities can reach out to industry for collaboration or funding. For example, Microsoft supports various fellowships (investigator fellowship, phd fellowship, dissertation grant, Ada Lovelace fellowship). Facebook, Amazon and others also have similar fellowship programs. ### What can the community do? 1. Avoid reducing reviewing to SOTA measurements A research paper is more than performance gain over prior work. Reducing reviewing to how many datasets it killed is harmful for long-term research. Consider for example two contributions: one which achieves 2% improvement over SOTA by tweaking the SOTA solution, and another work which takes a less travelled route for which previous performance was significantly less than SOTA and manages to make it come within 1% (but still less) of SOTA. Which one is more informative? The first one reaffirms belief that current SOTA techniques when mixed together work well. The second one says that we have grossly underestimated the performance of another route. 2. Take into consideration FLOP, memory footprint, carbon emission etc. in reviewing Traditionally, these statistics have either never been reported or if reported then ignored by reviewers. Conference guidelines should encourage authors to report these statistics. This doesn’t require any more experiments but a simple accounting. Reviewers should judge for themselves whether a 1% gain at the cost of 10x more compute is acceptance worthy. 3. Ask if empirical gains are coming from technical novelty or extensive tuning It is not rare to see a paper reporting 5% gain over SOTA but where the core idea is only contributing <1% to performance. A simple way to check this is to ask authors to perform ablations. 4. Evaluate candidates on core research contribution than citations, no. of papers, etc. Many AI students feel a lot of pressure from unable to match the productivity of large well-funded labs writing 50-100 papers a year (1-2 paper a week). Generally, most of these papers revolve around a few topic and do not contribute fundamentally in all 100 directions (that would be nearly impossible). Okay, but what does it have to do with compute-intensive work? Firstly, I speculate that if students are not pressured to write many papers, they would rely more on fundamental principles (or taking a completely new route) rather than using brute force compute power to improve rank on leaderboards. Secondly, if there is no pressure to publish constantly, then students maybe encouraged to try different research problems and increase the diversity of their research. Heavily tuned deep neural architecture trained with relatively simple SGD methods have revolutionized machine learning. Similarly, use of leaderboards have made it possible to perform controlled experiments which has huge impact. These two methods have resulted in a large amount of financial interest in machine learning and AI which is also benefitting other aspects of AI and ML (and for good). However, we might be starting to see some disadvantageous of over reliance on compute-intensive and leaderboard driven research. For those who are still feeling underwhelmed by the SOTA competition can find solace in knowing that there is still lots of work to do in developing re-usable first principles, in understanding existing solutions, and taking the more rigorous albeit slow path towards better solutions. On the way, you may find solutions that have been proposed before by people with great intuition, or by exhaustion of computational resources. However, there is great benefit in discovering a new elegant path to the same solution. ## Mathursday: PAC with Hoeffding-Bernstein PAC results with Hoeffding-Bernstein’s inequality are the bread and butter of machine learning theory. In this post, we’ll see how to use Hoeffding’s inequality to derive agnostic PAC bounds with an error rate that decreases as $O\left(\sqrt{\frac{1}{n}}\right)$ in number of training samples $n$. Then, we’ll see that in the realizable setting we can use Bernstein’s inequality to improve this to $O\left(\frac{1}{n}\right)$. ### Supervised Learning Consider the task of regression with input space $\mathcal{X}$ and output space $\mathcal{Y}$. We have access to a model class (also known as hypothesis family) $\mathcal{F} = \left\{ f: \mathcal{X} \rightarrow \mathcal{Y} \right\}$. We will assume $\mathcal{F}$ to be finite for simplicity. There exists some unknown data distribution $D \in \Delta\left(\mathcal{X} \times \mathcal{Y}\right)$ that is generating examples. Of course, we don’t have access to this data distribution but we have access to some samples from it. Let $S = \{(x_i, y_i)\}_{i=1}^n$ be $n$ independent identically distributed (i.i.d) samples from the data distribution i.e., $(x_i, y_i) \sim D(.,.)$ and probability of observing a sample $S$ is given by $P(S) = \prod_{i=1}^n D(x_i, y_i)$. In order, to evaluate the performance of a model we will need a loss function. Let $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow [0, 1]$ be a loss function. Given a datapoint $(x, y)$ and a model $f \in \mathcal{F}$, the loss associated with our prediction is given by $\ell(y, f(x))$. For now, we will not make any assumption about $D, \mathcal{F}, \ell$. Our aim is to find a function $f \in \mathcal{F}$ that is as good as possible. What is a good function? Well one that has the least loss function across the input space i.e., its generalization error $L_D(f)$ is as low as possible: $L_D(f) = \mathbb{E}_{(x, y) \sim D}\left[\ell(y, f(x)) \right]$. The only signal we have for finding such a $f$ is the training data $S$. We will define the training error $L_S(f) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$. The aim of machine learning theory is to (i) allow us to find a “good” $f$, and (ii) bound the performance of the predicted model in terms of $S, \mathcal{F}$ and other relevant parameters. The first result we will look at is the Occam’s Razor Bound and then we will see how to improve this result. ### Occam’s Razor Bound One of the first result that is taught in a machine learning theory class is the Occam’s Razor bound. The formal result is given in the theorem below. Occam’s Razor bound suggests using the model that minimizes the error on the observed sample $S$ i.e., the solution of the following optimization: $\hat{f}_S := \inf_{f \in \mathcal{F}} L_S(f)$. This solution is called the empirical risk minimizer (ERM) or the ERM solution. We will drop the $S$ from the notation when it is clear from the context. Theorem: Let $(\mathcal{X}, \mathcal{Y}, D, \ell)$ be a supervised learning problem with input space $\mathcal{X}$, output space $\mathcal{Y}$, data distribution $D \in \Delta(\mathcal{X} \times \mathcal{Y})$, and $\ell: \mathcal{Y} \times \mathcal{Y} \rightarrow [0, 1]$ be the loss function. We are given $n$ i.i.d samples $\{(x_i, y_i\}_{i=1}^n$ from the data distribution $D$. Let $\mathcal{F}$ be a finite collection of models from $\mathcal{X}$ to $\mathcal{Y}$. Fix any $\delta \in (0, 1)$ then with probability at least $1 - \delta$ over samples $S$ drawn i.i.d. from $D$, we have: $\forall f \in \mathcal{F}, \,\, \left|L_D(f) - L_S(f)\right| < \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}$, and $L_D(\hat{f}) < \inf_{f\in \mathcal{F}} L_D(f) + \sqrt{\frac{2}{n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}$. Proof: Fix any model in the model family $f \in \mathcal{F}$. Let $Z_i = \ell(y_i, f(x_i))$. Then observe $Z_i \in [0, 1]$ and $\{Z_i\}_{i=1}^n$ are i.i.d random variables. They are random variables with respect to the sample $S$ which is randomly chosen. Further, $L_S(f) = \frac{1}{n}\sum_{i=1}^n Z_i$ and $\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n Z_i\right] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}\left[Z_i\right] = \frac{1}{n} \sum_{i=1}^n L_D(f) = L_D(f)$, where the first equality follows from linearity of expectation and the second inequality follows from iid assumption. Then using Hoeffding’s inequality: $P( \left| L_D(f) - L_S(f) \right| \ge t) \le 2\exp^{-2nt^2}$, Setting $\delta = 2\exp^{-2nt^2}$, we get: $\left|L_D(f) - L_S(f) \right| < \sqrt{\frac{1}{2n}\log\left(\frac{2}{\delta}\right)}$ with probability of at least $1 - \delta$. Let $Q_f$ be result derived in the above equation then we have $P(Q_f) \ge 1 - \delta$ or in other words $P(\overline{Q_f}) < \delta$ ($\overline{X}$ represents the complement of measurable set $X$). Even though we derived this result for a fixed $f$, we never used any property of $f$ and therefore, it holds for all $f \in \mathcal{F}$. Therefore, we can bound the probability of $\cap_{f \in \mathcal{F}} Q_f$ using union bound: $P(Q_{\hat{f}}) = 1 - P(\overline{Q_{\hat{f}}}) \ge 1 - \sum_{f \in \mathcal{F}} P(\overline{Q_f}) \ge 1 - \sum_{f \in \mathcal{F}} \delta = 1 - \delta |\mathcal{F}|$. This means with probability of at least $1 - \delta |\mathcal{F}|$ we have: $\forall f \in \mathcal{F}, \,\,\,\,\,\, \left| L_D(f) - L_S(f) \right| < \sqrt{\frac{1}{2n} \log\left(\frac{2}{\delta}\right)}$. Replacing $\delta$ by $\frac{\delta}{|\mathcal{F}|}$ proves the first result. Let $\hat{f}$ be the ERM solution and fix $f$ in $\mathcal{F}$. Then: $L_D(\hat{f}) < L_S(\hat{f}) + \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}$ (from Hoeffidng’s inequality) $L_S(\hat{f}) \le L_S(f)$ (from the definition of ERM) $L_S(f) < L_D(f) + \sqrt{\frac{1}{2n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}$ (from Hoeffidng’s inequality) This implies, for any $f \in \mathcal{F}$, with probability at least $1 - \delta$: $L_D(\hat{f}) < L_D(f) + \sqrt{\frac{2}{n} \log\left(\frac{2|\mathcal{F}|}{\delta}\right)}$. As this holds for any $f \in \mathcal{F}$, therefore, we can pick the $f$ with smallest value of $L_D(f)$. This proves the second result. There are couple of things that stand out in this result: • More training data helps: Formally, the bound improves as $O\left(\sqrt{\frac{1}{n}}\right)$ in sample size $n$. Informally, this means as we have more training data we will learn a model whose generalized error has a better upper bound. • Complex model class can be undesirable: The bound becomes weaker as the model family $\mathcal{F}$ becomes larger. This is an important lesson in model regularization: if we have two model class $\mathcal{F}_1, \mathcal{F}_2$ that have the same training error on a sample $S$, then if $F_2$ is more complex class than $F_1$ i.e., $|\mathcal{F}_1| < |\mathcal{F}_2|$, then the bound on the generalization error for the later will be weaker. Note that the result does not say that the larger model class is always bad. We can get better bounds with a more complex model class provided we can appropriately improve the training error. ### Improving Occam’s Razor Bound with Bernstein’s Inequality Bernstein’s inequality allows us to incorporate the variance of the random variables in the bound thereby allowing us to get tighter bound for low variance settings. Theorem (Bernstein’s Inequality): Let $\left\{X_i\right\}_{i=1}^n$ be $n$ i.i.d random variables with mean $\mu$ and variance $\sigma^2$. Further, let $\left|X_i - \mu \right| \le M$ almost surely, for all $i$. Then: $P\left(\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| > t \right) \le 2\exp\left\{- \frac{3nt^2}{6\sigma^2 + 2Mt} \right\}$ If the variance was negligible $\sigma^2 \approx 0$ then we will get $P\left(\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| > t \right) \le 2\exp\left\{-\frac{3nt}{2M}\right\}$. This bound, if we repeat what we did for Occam’s Razor Bound, will give us $O\left( \frac{1}{n}\right)$ bound. Before we proceed to derive a $O\left( \frac{1}{n}\right)$ in realizable setting, we will first state the Bernstein’s inequality in the more familiar form. Setting the right hand side to $\delta$ and solving the quadratic equation gives us: With probability at least $1-\delta$ we have: $\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| \le \frac{M}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{M^2}{9n^2} \log^2\left(\frac{2}{\delta}\right) + \frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)}$ We now apply the inequality $\forall a, b \in [0, \infty), \sqrt{a + b} \le \sqrt{a} + \sqrt{b}$ to get: $\left| \frac{1}{n}\sum_{i=1}^n X_i - \mu \right| \le \frac{2M}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)}$. Theorem: Let $(\mathcal{X}, \mathcal{Y}, D, \ell)$ be a supervised learning problem with input space $\mathcal{X}$, output space $\mathcal{Y}$, data distribution $D \in \Delta(\mathcal{X} \times \mathcal{Y})$, and loss function $\ell \in [0, 1]$. Let there be $f^\star \in \mathcal{F}$ such that $L_D(f^\star) = 0$. We are given $n$ i.i.d samples $\{(x_i, y_i\}_{i=1}^n$ from the data distribution $D$. Let $\mathcal{F}$ be a finite collection of models from $\mathcal{X}$ to $\mathcal{Y}$. Fix any $\delta \in (0, 1)$ then with probability at least $1 - \delta$ we have: $L_D(\hat{f}) < \frac{20}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)$. Proof: Fix a $f \in \mathcal{F}$ and define $\{Z_1, \cdots, Z_n\}$ as before. This time we will apply Bernstein’s inequality. We have $|Z_i - \mu| < 1$ as the loss function is bounded in $[0,1]$. Bernstein’s inequality gives us: $\left|L_D(f) - L_S(f) \right| \le \frac{2}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2\sigma^2}{n}\log\left(\frac{2}{\delta}\right)}$, with probability of at least $1 - \delta$. All we need to do is bound the variance $\sigma^2 = \mathbb{E}_{(x, y) \sim D}[ (\ell(y, f(x)) - L_D(f) )^2] = \mathbb{E}_{(x, y) \sim D}[\ell(y, f(x))^2] - L_D(f)^2$. As the loss function is bounded in $[0, 1]$ therefore, $\mathbb{E}_{(x, y) \sim D}[\ell(y, f(x))^2] \le \mathbb{E}_{(x, y) \sim D}[1. \ell(y, f(x))] = L_D(f)$. This gives us $\sigma^2 \le L_D(f)$. Plugging this in we get: $\left|L_D(f) - L_S(f) \right| \le \frac{2}{3n} \log\left( \frac{2}{\delta} \right) + \sqrt{\frac{2L_D(f)}{n}\log\left(\frac{2}{\delta}\right)}$. Applying AM-GM inequality, we have $\sqrt{\frac{2L_D(f)}{n}\log\left(\frac{2}{\delta}\right)} < \frac{L_D(f)}{2} + \frac{1}{n}\log\left(\frac{2}{\delta}\right)$ giving us, $\left|L_D(f) - L_S(f) \right| \le \frac{L_D(f)}{2} + \frac{5}{3n} \log\left( \frac{2}{\delta} \right)$, with probability $1-\delta$. Applying union bound on all hypothesis we get: $\forall f \in \mathcal{F}, \,\,\,\, \left|L_D(f) - L_S(f) \right| \le \frac{L_D(f)}{2} + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)$, with probability at least $1-\delta$. Let $\hat{f}$ be the ERM solution then: $L_D(\hat{f}) < L_S(\hat{f}) + \frac{L_D(\hat{f})}{2} + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)$ (Bernstein’s inequality) $L_S(\hat{f}) < L_S(f^\star)$ (ERM solution) $L_S(f^\star) < \frac{3}{2}L_D(f^\star) + \frac{5}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)$ (Bernstein’s inequality) $L_D(f^\star) = 0$ (realizability condition) Combining these we get $L_D(\hat{f}) < \frac{20}{3n} \log\left( \frac{2|\mathcal{F}|}{\delta} \right)$ with probability at least $1-\delta$. Hence, proved. So what is happening? We are making a realizability assumption i.e., there is a function $f^\star \in \mathcal{F}$ such that $L_D(f^\star) = 0$. As the loss is lower bounded by 0, therefore, the variance of the samples of its loss are 0. Now as our ERM solver is becoming better with more data, it is converging towards this $f^\star$ whose variance is 0. Therefore, the variance of the ERM (which is bounded by its mean $\mu$) is becoming smaller. This helps in bounding the rate of convergence. Summary: We saw how Hoeffding’s inequality gives us a $O(\sqrt{\frac{1}{n}})$ bound in number of samples $n$. When we are in the realizable case, we can get $O(\frac{1}{n})$ using Bernstein’s inequality. ## Growing Bifurcation of AI Scholarship The field of artificial intelligence (which includes machine learning, computer vision, natural language understanding etc.) is going through an unprecedented phase of interest and enthusiasm. This is evident in the number of articles written in popular media, interest of non-experts in the field, increase in the audience at AI conferences, and rise in number of submitted articles to the top conferences. The later has posed a challenge in ensuring high quality reviewing at these conferences. AI, as in other scientific discipline, relies on double-blind peer review process to ensure that quality papers are added to the literature. While alternative exists such as arXiv where anyone can upload their paper. The importance of peer review is evident from observing that every year several papers get published on the Maths arXiv that claim to have solved Riemann Hypothesis (a famous unsolved problem). Almost all these papers have glaring mistakes. I picked a maths example cause an error in the proof is more glaring, however, arXiv CS papers can contain errors in proof, experimental protocol or an ethical concern that might be pointed out during peer review. However, the multiplicative increase in submissions at AI conferences have put this peer review process under stress. Various methods have been proposed to ensure that only quality submissions are submitted including: preventing authors to post on arXiv a month before the conference deadline (to avoid flag planting papers), adopting an open reviewing policy (to discourage resubmission without revision), and some interesting proposals that place limit on the number of papers an author can write every year. Anecdotally, it appears to me that a large increase in the submissions is driven by papers that demonstrate a novel application of an existing technology, or an important but not novel modification of an existing technology. These are perfectly valid and important contributions, however, it might be prudent to publish such papers separately. For example, we can create an applied research/demonstration conference where such papers are published. This conference can be held more frequently to accommodate the high frequency load and can have a more tailored reviewing process (such as focus on code review and application). The existing research conferences can then focus on more fundamental research questions. This has several advantages: • These papers will not face heat from NeuRIPS/ICML reviewers who demand technical novelty. A paper can have an actual application without being technically novel. But the current conference system doesn’t do justice to distinguish between these two type of contributions. • Encourages engineering claim such as hyperparameter tuning to be published. The current conference model inevitably pressures authors to come up with a technical story to sell their paper to reviewers. Often times, these papers are forced to pick/add a story whereas the biggest empirical gain might be coming from hyperparameter tuning. Alternatively, such results can be published as short applied papers that claim hyperparameter tuning to be their main contribution. For example, let’s say someone downloads the state-of-the-art RL algorithm and demonstrates better result by adding another layer in the model. Then such a result should be communicated but it would be better suited as a short applied AI conference paper instead of being packaged as an 8 page conference paper. If a paper claims to have found a better choice of hyper parameter for an important problem, then this result should be communicated. However, it should be a two paragraph article in an applied AI magazine instead of being packaged as an 8 page conference paper. • Load on existing conferences will be reduced. These mainstream conference can focus on the bigger picture such as understanding why existing neural network methods work, creating better optimization algorithm, or in general creating algorithms which are fundamentally novel and not minor modification of an existing algorithm. • The audience for an applied AI conference might be more suited for these papers than general NeuRIPS/ICML/ICLR audience. For example, an engineer working in the field of machine translation might readily benefit from the knowledge of improved hyperparameter tuning or an observation that doing curriculum learning is helpful. They will not have to extract this detail from the Appendix of an 8 page paper which will not have this detail as their main story. A significant bulk of AI submissions are applied research paper of the type described above. The current conference cycle does not do justice in appreciating these results and demands technical novelty which often results in a weird incentive system that pushes authors to come up with a story to package their results. A move towards the creation of an applied conference/workshop/magazines where such results are appreciated would result in both decrease in load on existing conferences while providing the right audience and appreciation for these applied AI papers. ## I am joining Microsoft Research I am excited to announce that I have graduated from Cornell and will be joining Microsoft Research, New York from the next week. This is one of my dream jobs and I am excited about working alongside a team of great researchers and engineers! I want to express my gratitude to my friends, colleagues, and staff at Cornell University. I learned a lot at Cornell and it will always be a special place for me. ## Mathursday: Dynkin’s π-λ Theorem and CDF (Part 2) The aim of this post is to prove that a cummulative distribution function (CDF) uniquely determines the probability distribution. This is a well known fundamental result that is quite intuitive but requires some advanced techniques to prove. In this post, we provide a complete proof using Dynkin’s π-λ Theorem. The previous post on this topic covered basic of topology and measure theory. If you are unfamiliar with concepts like measure, open set, Borel algebra then please see the previous post. The main result we want to prove is the following. Theorem: Let $f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B})$ and $g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B})$ be two random variables with CDF $F, G$ respectively. If $F(x) = G(x), \,\, \forall x \in \mathbb{R}$ then $P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\, \forall A \in \mathbb{B}(\mathbb{R})$. Notice that we want to prove $P_1(g^{-1}(A))=P_2(g^{-1}(A))$ for every $A \in \mathbb{B}(\mathbb{R})$. Proving this for every member of a σ-algebra is difficult. Instead we will prove it for a subset of σ-algebra and show that it holds for the entire σ-algebra. The last step will be achieved using the powerful Dynkin’s π-λ Theorem. This is going to be our first stop. ## Dynkin’s π-λ Theorem To state the theorem we will have to define two more things: Definition (π-system): A collection of subsets $S$ of $X$ is called a π-system if it is nonempty and closed under intersection i.e., $\forall A, B \in S \rightarrow A \cap B \in S$. It is trivial to verify that every algebra of subsets is a π-system. Definition (λ-system): A collection of subsets $S$ of $X$ is called a λ-system if: 1. $X \in S$. 2. For any $A, B \in S$ such that $A \subseteq B$ we have $B - A \in S$. 3. For any $\{A_i\}_{i=1}^\infty$ where $\forall i \in \mathbb{N}, A_i \in S$ and $A_i \subseteq A_{i+1}$ then $\cup_{i=1}^\infty A_i \in S$. This definition looks quite similar to that of a σ-algebra. Similarly a λ-system is closed under complement. Why? Let $S$ be a λ-system and let $A \in S$. Then as $X \in S$ and $A \subseteq X$ therefore, $A^c = X - A \in S$. However, unlike a σ-algebra which is closed under countable union of arbitrary sets, the condition (3) above only holds for increasing sequence of sets ($A_i \subseteq A_{i+1}$). It is however straightforward to show the following: Lemma 1: Every σ-algebra is both a π-system and a λ-system Proof: Trivial With these two definition we are ready to state the Dynkin’s π-λ Theorem. Dynkin’s π-λ Theorem: If $A$ is a π-system and $B$ is a λ-system and $A \subseteq B$ then $A \subseteq \sigma(A) \subseteq B$ The proof of this theorem is short but tricky. We defer the proof for now to first demonstrate the usefulness of the above result. Usefulness: The way we use Dynkin’s π-λ Theorem is by showing that the property holds for a π-system and which generates the σ-algebra of interest. Then we show that the collection of sets where the property holds is a λ-system. Dynkin’s π-λ Theorem then will show that the property holds for the entire σ-algebra. ## CDF Uniquely Determines the Distribution Theorem: Let $f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B})$ and $g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B})$ be two random variables with CDF $F, G$ respectively. If $F(x) = G(x), \,\, \forall x \in \mathbb{R}$ then $P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\,\ \forall A \in \mathbb{B}(\mathbb{R})$. Proof: Let $S$ be the collection of sets where the two probabiltiy measures agree, i.e., $S = \left\{ A \in \mathbb{B}(\mathbb{R}) \mid P_1(g^{-1}(A)) = P_2(g^{-1}(A))\right\}$. We will construct a π-system where the property holds. Claim 1: Let $T = \left\{ \left(-\infty, x\right] \mid x \in (-\infty, \infty] \right\}$ be the collection of closed rays on real number line. Then $T$ is a π-system, $T \subseteq S$ and $\sigma(T) = \mathbb{B}(\mathbb{R})$. Proof of Claim 1: The proof directly exploits the relation between closed rays and CDF. For any value of $x \in (-\infty, \infty]$ we have $P_1(g^{-1}((-\infty, x])) = F(x) = G(x) = P_2(g^{-1}((-\infty, x]))$. Hence, $(-\infty, x] \in T$ i.e., all closed rays are in $T$. Secondly, $T$ is closed under intersection as intersection of closed rays is another closed ray ($\,\, (-\infty, x] \cap (-\infty, y] = (-\infty, \min(x,y)] \,\,$). As $T$ is nonempty therefore, it is a π-system. As we proved in the last post, the collection of closed rays generates the Borel σ-algebra i.e., $\sigma(T) = \mathbb{B}(R)$. Great! so now we have a π-system where the desired result holds and that generates $\mathbb{B}(\mathbb{R})$ which in this case is the σ-algebra of interest. Claim 2: $S$ is a λ-system Proof of Claim 2: We prove each required property of a λ-system below. • From unitarity property of a probability measure we have $P_1(X) = 1 = P_2(X)$. Therefore, $X \in S$. • Let $A, B \in S$ and $A \subseteq B$. Let $C = B-A$ then $C$ and $A$ are disjoint. Therefore, for any probability measure $P$ we have $P(C \cup A) = P(A) + P(C) \Rightarrow P(C) = P(B) - P(A)$. Then $P_1(C) = P_1(B) - P_1(A) = P_2(B) - P_2(B) = P_2(C)$. Therefore, $C=B-A \in S$. • Let $\{A_i\}_{i=1}^\infty$ be a sequence of increasing sets and $\forall i \in \mathbb{N}, \,\, A_i \in S$. For any probability measure $P$ we have: $P(\cup_{i=1}^\infty A_i) = \lim_{n \rightarrow \infty} P(A_n)$. We have $P_1(A_n) = P_2(A_n)$ for every value of $n$. If two sequences have the same members then they have the same limit. Hence, $\lim_{n \rightarrow \infty} P_1(A_n) = \lim_{n \rightarrow \infty} P_2(A_n) \Rightarrow P_1(\cup_{i=1}^\infty A_i) = P_2(\cup_{i=1}^\infty A_i)$. Therefore, $\cup_{i=1}^\infty A_i \in S$. Wrapping up: We showed that $T \subseteq S$ where $T$ is a π-system and $S$ is a λ-system. Therefore, using Dynkin’s π-λ theorem we get $\sigma(T) = \mathbb{B}(\mathbb{R}) \subseteq S$. What does this mean? It means that for any $A \in \mathbb{B}(\mathbb{R})$ we have $A \in S$ and therefore, by definition of $S$ we have $P_1(g^{-1}(A)) = P_2(g^{-1}(A))$. This is exactly what we wanted to prove. ## Proof of Dynkin’s π-λ Theorem In order to proof this result we will make use of a couple of basic lemma: Lemma 2: Intersection of λ-systems is a λ-system. Proof: The proof is trivial. For example, let $\{S_i\}_1^\infty$ be a set of λ-systems (so this is a set of set of set #inception). Now let $A, B \in \cap_{i=1}^\infty S_i$ and $A \subseteq B$. As each $S_i$ is a λ-system therefore, $B-A \in S_i$ for every $i \in \mathbb{N}$. This implies, $B-A \in \cap_{i=1}^\infty S_i$. Lemma 3: Intersection of σ-algebra is a σ-algebra Proof: Trivial to verify. Lemma 4: If a collection of subsets of $X$ is both a π-system and a λ-system then it is a σ-algebra Proof: Let $S$ be both a π-system and a λ-system. Then we will show it satisfies all three properties of a σ-algebra • As $S)$ is a λ-system therefore, $X \in S$. Similarly, using the second property of a λ-system we have, $X \subseteq X \Rightarrow \emptyset = X - X \in S$. • Let $E \in S$ then $E \subseteq X \Rightarrow E^c = X - A \in S$. • Let $\{E_i\}_{i=1}^\infty$ be a collection of subsets of $X$. Define a sequence of increasing sets as follows: $Y_1 = E_1$ and $Y_i = E_i \cup Y_{i-1}$. Further, $Y_1 \subseteq Y_2 \subseteq Y_3 \cdots$. If somehow we could show that $Y_i \in S$ then we can show that $\cup_{i=1}^\infty Y_i = \cup_{i=1}^\infty E_i \in S$. This would be interesting since we will show that every λ-system is a σ-algebra as we haven’t used the property of a π-system so far. However, as we will see we need the property to show $Y_i \in S$. We make an inductive argument: $Y_1 = E_1 \in S$. Assume $Y_k \in S$ for all values of$k <= i\$. Now $Y_{i+1} = E_{i+1} \cup Y_i = \left( E_{i+1}^c \cap Y_i^c \right)^c$. As $S$ is closed under complement (point 2 above) and intersection (S is a π-system) therefore, $Y_{i+1} \in S$. Hence, proved.

Dynkin’s π-λ Theorem: If $A$ is a π-system and $B$ is a λ-system and $A \subseteq$ then $A \subseteq \sigma(A) \subseteq B$.

Proof: Let $\ell(A)$ be the smallest λ-system containing $A$.  As intersection of λ-systems is another λ-system therefore, we can define $\ell(A)$ as the intersection of every λ-system containing $A$. As the power set of $X$ is a λ-system therefore, $\ell(A)$ is well defined. As $B$ is another λ-system containing $A$ and therefore, $|\ell(A)| <= |B|$. However, we can go further and claim $\ell(A) \subseteq B$. As if this wasn't the case then the intersection $\ell(A) \cap B$ will be a smaller λ-system containing $A$ which is a contradiction. This argument will be repeated again and again so make sure you understand it!

Now every σ-algebra is a λ-system and as $\sigma(A)$ contains $A$ therefore, $\ell(A) \subseteq \sigma(A)$. Here we have used the above argument to show containment. At this point we have shown $\ell(A) \subseteq B$ and $\ell(A) \subseteq \sigma(A)$. If we can show that $\ell(A) = \sigma(A)$ then we will be done. This is exactly what we will show.

Now this is where the proof gets tricky. We first define a rather unintuitive set:

For any $E \subseteq X$ we define $D_E = \{ F \mid F \subseteq X, E \cap F \in \ell(A) \}$. We will now make a couple of claims:

Claim 1: $D_E$ is a λ-system  for any $E \in \ell(A)$.
Proof of Claim 1:
We prove each property of a λ-system below:

• $X \cap E = E \in \ell(A)$ therefore, $X \in D_E$
• Let $F_1, F_2 \in D_E$ and $F_1 \subseteq F_2$. This implies $F_1 \cap E \in \ell(A)$ and $F_2 \cap E \in \ell(A)$. Further, $F_1 \subseteq F_2 \Rightarrow F_1 \cap E \subseteq F_2 \cap E$.
As $\ell(A)$ is a λ-system therefore, $F_2 \cap E - F_1 \cap E = (F_2 - F_1) \cap E \in \ell(A)$. From the definition of $D_E$ this implies $F_2 - F_1 \in D_E$.
• Let $\{F_i\}_{i=1}^\infty$ be a countable collection of sets in $D_E$. This implies, $F_i \cap E \in \ell(A)$. As $\ell(A)$ is a λ-system therefore, $\cap_{i=1}^\infty (F_i \cap E) = \left( \cap_{i=1}^\infty F_i \right) \cap E \in \ell(A)$. This implies, $\cap_{i=1}^\infty F_i \in D_E$.

So far we have not use the fact that $A$ is a π-system. The next claim will exploit this property:

Claim 2: $A \subseteq D_E$ for every $E \in A$.
Proof of Claim 2: As $A$ is a π-system therefore, $F \in A \Rightarrow F \cap E \in A$. As $A \subseteq \ell(A)$ therefore, $F \cap E \in \ell(A)$. This implies $F \in D_E$.

Claim 3: Define $A \cap \ell(A) = \left\{ E \cap F \mid E \in A, F \in \ell(A)\right\}$. Then $A \cap \ell(A) \subseteq \ell(A)$.
Proof of Claim 3:  For $E \in A \subseteq \ell(A)$ we have $D_E$ as a λ-system (Claim 1) containing $A$ (Claim 2). This implies $\ell(A) \subseteq D_E$ for every $E \in A$. Therefore, for any $F \in \ell(A)$ we have $F \in D_E$ which implies $F \cap E \in \ell(A)$.

Using Claim 3 we will now strengthen Claim 2.

Claim 4: $A \subseteq D_E$ for every $E \in \ell(A)$.
Proof of Claim 4:  Let $E \in \ell(A)$ and $F \in A$ then $F \cap E \in \ell(A)$ (Claim 3). This implies $F \in D_E$.

Claim 5: $\ell(A)$ is closed under intersection.
Proof of Claim 5: For $E \in \ell(A)$ we have $D_E$ as a λ-system (Claim 1) containing $A$ (Claim 4). This implies $\ell(A) \subseteq D_E$ for every $E \in \ell(A)$. Now let $F \in \ell(A)$. Then $F \in D_E$ and therefore, by definition of $D_E$ we have $F \cap E \in \ell(A)$. Proof is completed by observing that $E, F$ are arbitrary members of $\ell(A)$.

At this point, take a moment to observe the interesting pattern in Claim 2-5.

Wrapping Up: Claim 5 implies $\ell(A)$ is both a π-system and a λ-system. From Lemma 4, this implies it is a σ-algebra. Further, $A \subseteq \ell(A)$ hence we must have $\sigma(A) \subseteq \ell(A)$ otherwise, we can create a smaller σ-algebra $\ell(A) \cap \sigma(A)$ which is smaller than $\sigma(A)$ and contains $A$ (Lemma 3). However, we also showed earlier that $\ell(A) \subseteq \sigma(A)$. This implies $\sigma(A) = \ell(A)$.

At the beginning of the proof, we showed $A \subseteq \ell(A)$ and $\ell(A) \subseteq B$. Therefore, we have $A \subseteq \ell(A) = \sigma(A) \subseteq B$. Hence, proved.

If you are interested in learning more about probability theory then John Pike‘s lecture notes are a useful resource:

## Mathursday: Dynkin’s π-λ Theorem and CDF (Part 1)

I am starting a new series of posts called “Mathursday” (portmanteau: Maths+Thursday) containing short articles on  important mathematical results. Today’s article is about Dynkin’s π-λ Theorem and using it to prove that a Cummulative Distribution Function (CDF) uniquely determines the probability distribution. This is one of the result that is quite intuitive and well known but requires a bit of analysis to formally prove or even to state. There are several such results in statistics that use similar techniques for proving. Generally, the treatment of these proofs is completely omitted in statistics book (like Wasserman). The alternative is to meander your way through mathematical course on probability theory to extract the result. In this series of posts, I hope to give a direct self-contained mathematical treatment.

The proof uses basic results and definition from two important areas of mathematics: topology and measure theory. This series is divided into two posts. The first post (this one) covers basic of the above fields. The aim here is not to reproduce a complete treatment as found in mathematical books but give a concise concentrated basic treatment of different topics. You can skip to the next post if you already know these basics.

## Set Theory

A set $X$ is a collection of some members. Each member in the set appears one time. It can be finite (e.g., natural numbers less than 20), countable (meaning it can be enumerated e.g., set of odd natural numbers) or uncountable (e.g., real numbers). A subset $A$ of $X$ (denoted as $A \subseteq X$) is a set containing only members of $X$. Power set of a set $X$ (denoted as $2^X$) is the set of all subsets of $X$. A set containing no members is called an empty set (denoted by $\emptyset$). The cardinality of a set $S$ (denoted by $|S|$) is the number of members in the set.

For two sets $A$ and $B$:  their union (denoted $A \cup B$) is a set containing all members of $A$ and $B$; their intersection (denoted $A \cap B$) is a set containing members common to $A$ and $B$; the set subtraction $A-B$ is a set obtained by removing all members of $B$ present in $A$ while retaining other members. If $A \subseteq X$ is a subset of $X$ then the complement set of $A$ is defined as $A^c = X - A$. Observe that for any two sets $A, B$ we have $A-B = A \cap B^c$. There is a useful identity called De Morgan’s law which states, for any two sets $A, B$: $(A\cap B)^c = A^c \cup B^c$ and $(A\cup B)^c = A^c \cap B^c$.

While the above may look all easy and simple, developing an axiom based version of set theory has been quite challenging. Naive set theory suffered from Russell’s paradox while modern attempts like ZFC had their own controversy. For our current purpose, we won’t go into these details.

## General Topology

Definition (Topology): $(X, \tau)$ is called a topological space if $X$ is a set and $\tau \subset 2^X$ is a collection of subsets of $X$ called open sets that satisfy the following properties:
1) $\emptyset, X \in \tau$
2) $\tau$ is closed under countable union i.e, if $A_i \in \tau\,\, \forall i \in \mathbb{N}$ then $\cup_{i \in \mathbb{N}} A_i \in \tau$.
3) $\tau$ is closed under finite intersection i.e., if $A_i \in \tau\,\, \forall i \in \{1, \cdots, k\}$ for $k < \infty$ then $\cap_{i=1}^k A_i \in \tau$.
If $\tau$ satisfy these properties then it is called a topology on $X$.

There is a reason why we call members of the topology as open set and this will become clear when we consider topology induced by metric spaces.

Example 1: For any set $X$, the power set $2^X$ and $\{\emptyset, X\}$ are topologies on $X$. Let $X = \{1, 2, 3\}$ then $\tau = \{ \emptyset, \{1\}, \{2, 3\}, \{1, 2, 3\}\}$ is a topology on $X$.

Definition (Metric Spaces): $(X, d)$ is a metric space if $X$ is a vector space and $d: X \times X \rightarrow \mathbb{R}$ is a function satisfying the following properties for all $x, y, z \in X$:
1) $d(x, y) = 0 \Longleftrightarrow x=y$ (identity of indiscernibles)
2) $d(x, y) = d(y, x)$ (symmetry)
3) $d(x, y) \le d(x, z) + d(z, y)$ (triangle inequality)
If $d$ satisfies the above properties then it is called a metric on $X$.

Observe that setting $y=x$ in the triangle inequality gives us $d(x, x) \le d(x, z)+ d(z, x)$. Using symmetry we get $d(x, x) \le 2d(x, z)$. As $d(x, x)=0$ we get $d(x, z) \ge 0$. Hence, $d$ is a non-negative valued function.

Example 2: Metric is supposed to abstract commonly used distance functions. For example, Minkowski distance $d(x, y) = \| x - y\|_p = \left( \sum_{i=1}^n |x_i-y_i|^p \right)^{\frac{1}{p}}$ for $p \in (0, \infty]$ is a metric function. It is straightforward to verify that $\| x - y\|_p$ the first two property. Proving triangle inequality is more challenging and uses Minkowski’s inequality.

One can define various sort of interesting geometric subsets using the metric. For example, we can define a ball of radius $\epsilon$ as $B_\epsilon(x) = \left\{ y \in X \mid d(x, y) \le \epsilon \right\}$.

Metric spaces $(X, d)$ provide a way to induce a useful topology on $X$. We define the topology induced by metric space as follows:

$\tau = \left\{ S \subseteq 2^X \,\,|\,\, \forall x \in S, \exists\,\, \epsilon > 0\,\, \mbox{ such that } \,\, B_\epsilon(x) \subseteq S \right\}$

It is straightforward to verify that $\tau$ is a topology. It is interesting to note that if $X = \mathbb{R}$ and $d(x, y) = |x-y|$, then open intervals $(a, b)$ (open in the geometric sense) are members of $\tau$. This means our standard geometric notion of open set coincides with open set in topology. Unless mentioned otherwise, we will assume the topology defined on real number space $\mathbb{R}^n$ to be the topology defined above using Euclidean distance as our choice of metric.

Lemma 1: Every open set on $\mathbb{R}$ is a countable union of disjoint open intervals.
Proof: Let $S$ be an open set on $\mathbb{R}$. We define an equivalence relation $x \sim y$ if  $[\min(x, y), \max(x, y)] \subseteq S$. It is easy to see that each equivalance class is an open interval $(a, b)$ for some $a, b \in \mathbb{R}$ with $a < b$. Further, these open classes are disjoint else they can be merged together. Finally, for each equivalence class we can associate a rational number in the interval. The proof is completed by observing that the set of rational numbers is countable.

## Measure Theory

Measure theory is a field of mathematics concerned with assigning value to subsets that provides a useful theoretical basis for probability theory, for defining Lebesgue integration and studying dynamical systems. Below we consider main concepts in measure theory that will be useful in our proof.

Definition (Algebra on Subsets): For a set $X$, a nonempty collection of subsets $C \subseteq X$ is an algebra of subsets if:
1. $A \in C \rightarrow A^c \in C$ (closed under complement).
2. $A, B \in C \rightarrow A \cap B \in C$ (closed under finite intersection).

Lemma 2: If $C$ is an algebra on $X$ then $\emptyset, X \in C$ and for all $A, B \in C$ we have $A \cup B \in C$.
Proof: As $C$ is nonempty let $A \in C$. Then $A^c \in C$ and $\emptyset = A \cap A^c = C$. From DeMorgan’s law: $A \cup B = (A^c \cap B^c)^c$ which belongs to $C$ as the algebra is closed under complement and finite intersections. Lastly, $X = A \cup A^c \in C$.

Definition (σ-algebra): For a set $X$, a σ-algebra is an algebra on $X$ that is closed under countable union i.e., if $\{A_i\}_{i\in \mathbb{N}}$ is a countable collection of sets in the σ–algebra then $\cup_{i\in \mathbb{N}} A_i \in$ σ-algebra.

Observe that the second condition of algebra of subsets only implies it is closed under finite union and not closed under countable unions.  For example, let consider a collection of subsets $S$ of $\mathbb{R}$ which are finite or have finite complement. Then $S$ is an algebra and closed under finite union: union of any two members still yields a finite set (or a finite complement). But if we take countable unions of members of $S$ then we can get the set of natural numbers which are neither finite nor have a finite complement. Therefore, the countable union condition in σ-algebra makes it a strict generalization of algebra on subsets.

Using De Morgan’s law we can show that any σ-algebra is also closed under countable intersection: let $\{A_i\}_{i=1}^\infty$ be a countable collection of sets in the σ-algebra. Then $\cap_{i=1}^\infty A_i = \left(\cup_{i=1}^\infty A_i^c\right)^c$. As σ-algebra  is closed under intersections and countable union therefore, $\cap_{i=1}^\infty A_i$ belongs to the σ-algebra. Another nice property of algebra of subsets and σ-algebra is that they are closed under countable intersection. We leave the proof as it is quite straightforward.

Lemma 2: A countable intersection of σ-algebras is also a σ-algebra.

The intersection property of σ-algebra allows us to define a useful notion: Given a subset $Y \subset X$, we define $\sigma(Y)$ as the smallest σ-algebra that contains $Y$. This can be formally defined as: $\sigma(Y) = \cap \left\{ \Sigma \mid \Sigma \right.$ is a σ-algebra and $\left. Y \subseteq \Sigma \right\}$. This definition is well formed as the power set is always a σ-algebra containing any subset.

σ-algebra and Topology: One cannot fail to observe a remarkable similarity between the definition of σ-algebra and topology. Given a topology and σ-algebra on a set $X$. Both contain the empty set and $X$, and are closed under countable union. However, a topology is not closed under complement and is only closed under finite intersection. This difference stems from their different purpose. For example, assigning measure to a set intuitively suggests being able to assign measure to its complement.

In general, measure theory and topology do not always sync. One has to only look at the different topics in the two fields. However, there is an important interplay between topology and measure theory that allows us to talk about measure on top of open sets. Borel set captures this idea:

Definition (Borel Set): Given a topological space $(X, C)$, the Borel σ-algebra $\mathbb{B}$ is the smallest σ-algebra generated by topology: $\mathbb{B} = \sigma(C)$. A member of $\mathbb{B}$ is called a Borel set.

Lemma 3: For standard topology on $\mathbb{R}$, the Borel σ-algebra $\mathbb{B}(\mathbb{R})$ on $\mathbb{R}$ is given by:

1) $\mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (a, b) \mid a, b \in \mathbb{R} \right\}\right)$
2) $\mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (-\infty, a] \mid a \in \mathbb{R} \right\}\right)$
3) $\mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (-\infty, a) \mid a \in \mathbb{R} \right\}\right)$
4) $\mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ [b, \infty) \mid b \in \mathbb{R} \right\}\right)$
5) $\mathbb{B}(\mathbb{R}) = \sigma\left(\left\{ (b, \infty) \mid b \in \mathbb{R} \right\}\right)$

Proof: We prove the first two case. The proof for remaining case is similar.

1. From Lemma 1, any open set in $\mathbb{R}$ is generated by a countable union of disjoint open intervals. As σ-algebra is closed under countable union so $\mathbb{B}(\mathbb{R})=\sigma\left(\left\{ (a, b) \mid a, b \in \mathbb{R} \right\}\right)$.

2. Let $A = \sigma\left(\left\{ (-\infty, a] \mid a \in \mathbb{R} \right\}\right)$. As $A$ is closed under complement therefore, we can also express: $A = \sigma\left(\left\{ (b, \infty) \mid b \in \mathbb{R} \right\}\right)$. However, $(b, \infty)$ is an open set and $A$ is a σ-algebra generated by a collection of open sets. Therefore, $A \subseteq \mathbb{B}(\mathbb{R})$.

Let $a < b$ then $(-\infty, b], (a, \infty) \in A$. As a σ-algebra is closed under intersection, therefore, $(a, b] = (-\infty, b] \cap (a, \infty) \in A$. It can now be shown that any open interval can be generated from countable union: $(a, b) = \cup_{n=1}^\infty \left(a, b - \frac{1}{n}\right] \in A$. As $A$ contains open interval therefore, it contains all member of $\mathbb{B}(\mathbb{R})$ (using our proof of case 1). This implies $\mathbb{B}(\mathbb{R}) \subseteq A$. Hence, $\mathbb{B}(\mathbb{R}) = A$.

Definition (Measurable Space): A measurable space is a tuple $(X, \Sigma)$ where $X$ is a set and $\Sigma$ is a σ-algebra on $X$.

Example 3: $(\mathbb{R}, \mathbb{B})$ is a measurable space where the real number line is the set and set of all Borel sets is our σ-algebra.

The next notion is useful in defining the notion of random variables in probability theory.

Definition (Measurable Function): A function $f: (X_1, \Sigma_1) \rightarrow (X_2, \Sigma_2)$ is called measurable if $(X_1, \Sigma_1), (X_2, \Sigma_2)$ are measurable spaces, and $\forall E \in \Sigma_2, f^{-1}(E) \in \Sigma_1$.

In the above definition: $f^{-1}(E) = \left\{ x \in X_1 \mid f(x) \in E \right\}$. The job of measurable functions is to ensure that we stick to members in σ-algebras when doing calculations.

Definition (Measure): Let $X$ be a set and $\Sigma$ be a σ-algebra on $X$. A measure $\mu: \Sigma \rightarrow [0, \infty]$ is a function satisfying the following property:
1. $\mu(\emptyset) = 0$ (null empty set)
2. If $\{E_i\}_{i=1}^\infty$ be a countable collection of pairwise disjoint members of $\Sigma$ then $\mu(\cup_{i=1}^\infty E_i) = \sum_{i=1}^\infty \mu(E_i)$ (σ-additivity).

The σ-additivity condition also holds for a finite set: let $\{E_1, \cdots, E_k\}$ be a set of pairwise disjoint members of $\Sigma$. Then define $E_j = \emptyset$ for all values of $j > k$. Then $\{E_i\}_{i=1}^\infty$ is an infinite countable set of pairwise disjoint members (Yes, $\{\emptyset, \emptyset\}$ are disjoint). Therefore, $\mu(\cup_{i=1}^k E_i) = \mu(\cup_{i=1}^\infty E_i) = \sum_{i=1}^\infty \mu(E_i) = \sum_{i=1}^k \mu(E_i)$. The last equality used $\mu(\emptyset) = 0$.

A measure is called finite if $\forall E \in \Sigma, \mu(E) < \infty$. If for any member $E \in \Sigma$ we have $\mu(E) = 0$ then we call $E$ a null set. Intuitively, null set denotes the smallest subsets in the σ-algebra.

Example 4: If $X$ is finite and the σ-algebra is the power set then $\mu(E) = |E|$ is a measure. This measure is called the counting measure.

Measures are intuitively supposed to generalize the common notion of length, area, and volume. In general however,  constructing a useful measure such as Lebesgue measure is quite challenging. See Hunter’s lecture note on measure theory for full treatment.

Definition (Probability Measure):Let $X$ be a set and $\Sigma$ be a σ-algebra on $X$. A probability measure $P$ is a measure with the unitarity property that states: $P(X) = 1$.

When talking about probability measures, we understand $X$ as the set of elementary outcomes and $\Sigma$ as a collection of events. For example, say we flip a coin two times then an elementary outcome is $[H, H]$ indicating two heads, and an event would be outcomes with at least one heads.

The need for σ-algebra: The definition of σ-algebra looks opaque. Why cannot we simply use the power set to define the domain of our measure? The answer to this question has an interesting history which arises from the discovery of non-measurable sets at the beginning of 20th century. Mathematicians found that if we consider all possible subsets of $\mathbb{R}$ for defining a measure, then no function exists that satisfies the above properties and generalizes the notion of length of an interval i.e., $\mu([a, b]) = b-a$! Interesting non-measurable sets like the Vitali set were discovered. The concept of σ-algebra was, therefore, used to narrow down the choice of subsets while maintaining useful properties. For example, if we talk about the probability of an event $E$ then we may also want to consider the probability of it not taking place i.e., $E^c$ hence our σ-algebra must contain $E^c$.

Definition (Measure Space and Probability Space): A measure space is a tuple $(X, \Sigma, \mu)$ where $(X, \Sigma)$ is a measurable space and $\mu$ is a measure on $\Sigma$. A measure space is called a probability space if $\mu$ is a probability measure i.e., $\mu(X) = 1$.

Definition (Random Variables): A real-valued random variable $g: (X, \Sigma, P) \rightarrow (\mathbb{R}, \mathbb{B})$ is a measurable function from a probability space  $(X, \Sigma, P)$ to the measurable space $(\mathbb{R}, \mathbb{B})$.

When considering the measurability of the random variable we ignore the probability measure $P$ and consider $(X, \Sigma)$ as our input measurable space. Further, even though we consider Borel σ-algebra  as our choice of σ-algebra  on $\mathbb{R}$ we can use any other σ-algebra.

Cummulative Distribution Function: Given a random variable $g: (X, \Sigma, P) \rightarrow (\mathbb{R}, \mathbb{B})$, the cummulative distribution function (CDF) is a function $G: \mathbb{R} \rightarrow [0, 1]$ defined as $G(z) = P(g^{-1}((-\infty, z]))$.

This definition is a bit involved. Firstly, observe that the closed ray $(-\infty, z] \in \mathbb{B}$ for any value of $z \in \mathbb{R}$. Therefore, as random variable is a measurable function we have $g^{-1}((-\infty, z]) \in \Sigma$. As $P$ is a probability measure on $\Sigma$ therefore $P(g^{-1}((-\infty, z]))) \in [0, 1]$.

The reason it is called CDF is since we can think of it as the total accumulated probability measure assigned to the outcomes that give a value of less than equal to $z$, as we vary $z$ from $-\infty$ to $+\infty$. It is straightforward to verify that CDF is a non-decreasing function. Further, it can be easily shown that $\lim_{z \rightarrow -\infty} G(z) = 0$ and $\lim_{z \rightarrow \infty} G(z) = 1$.

We are now ready to state the main theorem that we prove in the next post.

Theorem: Let $f: (X_1, \Sigma_1, P_1) \rightarrow (\mathbb{R}, \mathbb{B})$ and $g: (X_2, \Sigma_2, P_2) \rightarrow (\mathbb{R}, \mathbb{B})$ be two random variables with CDF $F, G$ respectively. If $F(x) = G(x), \,\, \forall x \in \mathbb{R}$ then $P_1(g^{-1}(A))=P_2(g^{-1}(A)), \,\, \forall A \in \mathbb{B}(\mathbb{R})$.

We do not require the two random variables to even have the same input space or σ-algebra. The theorem states that if two real-valued random variables have the same CDF then they have the same induced distribution over the real number line.