Machine Learning

Machine Learning

A sane introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP)

It is frustrating to learn about principles such as maximum likelihood estimation (MLE), maximum a posteriori (MAP) and Bayesian inference in general. The main reason behind this difficulty, in my opinion, is that many tutorials assume previous knowledge, use implicit or inconsistent notation, or are even addressing a completely different concept, thus overloading these principles.

Those aforementioned issues make it very confusing for newcomers to understand these concepts, and I’m often confronted by people who were unfortunately misled by many tutorials. For that reason, I decided to write a sane introduction to these concepts and elaborate more on their relationships and hidden interactions while trying to explain every step of formulations. I hope to bring something new to help people understand these principles.

Maximum Likelihood Estimation

The maximum likelihood estimation is a method or principle used to estimate the parameter or parameters of a model given observation or observations. Maximum likelihood estimation is also abbreviated as MLE, and it is also known as the method of maximum likelihood. From this name, you probably already understood that this principle works by maximizing the likelihood, therefore, the key to understand the maximum likelihood estimation is to first understand what is a likelihood and why someone would want to maximize it in order to estimate model parameters.

Let’s start with the definition of the likelihood function for continuous case:

$$\mathcal{L}(\theta | x) = p_{\theta}(x)$$

The left term means “the likelihood of the parameters \(\theta\), given data \(x\)”. Now, what does that mean ? It means that in the continuous case, the likelihood of the model \(p_{\theta}(x)\) with the parametrization \(\theta\) and data \(x\) is the probability density function (pdf) of the model with that particular parametrization.

Although this is the most used likelihood representation, you should pay attention that the notation \(\mathcal{L}(\cdot | \cdot)\) in this case doesn’t mean the same as the conditional notation, so be careful with this overload, because it is always implicitly stated and it is also often a source of confusion. Another representation of the likelihood that is often used is \(\mathcal{L}(x; \theta)\), which is better in the sense that it makes it clear that it’s not a conditional, however, it makes it look like the likelihood is a function of the data and not of the parameters.

The model \(p_{\theta}(x)\) can be any distribution, and to make things concrete, let’s say that we are assuming that the data generating distribution is an univariate Gaussian distribution, which we define below:

$$
\begin{align}
p(x) & \sim \mathcal{N}(\mu, \sigma^2) \\
p(x; \mu, \sigma^2) & \sim \frac{1}{\sqrt{2\pi\sigma^2}} \exp{ \bigg[-\frac{1}{2}\bigg( \frac{x-\mu}{\sigma}\bigg)^2 \bigg] }
\end{align}
$$

If you plot this probability density function with different parametrizations, you’ll get something like the plots below, where the red distribution is the standard Gaussian \(p(x) \sim \mathcal{N}(0, 1.0)\):

A selection of Normal Distribution Probability Density Functions (PDFs). Both the mean, μ, and variance, σ², are varied. The key is given on the graph. Source: Wikimedia Commons.

As you can see in the probability density function (pdf) plot above, the likelihood of \(x\) at variously given realizations are showed in the y-axis. Another source of confusion here is that usually, people take this as a probability, because they usually see these plots of normals and the likelihood is always below 1, however, the probability density function doesn’t give you probabilities but densities. The constraint on the pdf is that it must integrate to one:

$$\int_{-\infty}^{+\infty} f(x)dx = 1$$

So, it is completely normal to have densities larger than 1 in many points for many different distributions. Take for example the pdf for the Beta distribution below:

Probability density function for the Beta distribution. Source: Wikimedia Commons.

As you can see, the pdf shows densities above one in many parametrizations of the distribution, while still integrating into 1 and following the second axiom of probability: the unit measure.

So, returning to our original principle of maximum likelihood estimation, what we want is to maximize the likelihood \(\mathcal{L}(\theta | x)\) for our observed data. What this means in practical terms is that we want to find the parameters \(\theta\) of our model where the likelihood that this model generated our data is maximized, we want to find which parameters of this model are most plausible to have generated this observed data, or what are the parameters that make this sample most probable ?

For the case of our univariate Gaussian model, what we want is to find the parameters \(\mu\) and \(\sigma^2\), which for convenient notation we collapse into a single parameter vector:

$$\theta = \begin{bmatrix}\mu \\  \sigma^2\end{bmatrix}$$

Because these are the statistics that completely define our univariate Gaussian model. So let’s formulate the problem of the maximum likelihood estimation:

$$
\begin{align}
\hat{\theta} &= \mathrm{arg}\max_\theta \mathcal{L}(\theta | x) \\
&= \mathrm{arg}\max_\theta p_{\theta}(x)
\end{align}
$$

This says that we want to obtain the maximum likelihood estimate \(\hat{\theta}\) that approximates \(p_{\theta}(x)\) to a underlying “true” distribution \(p_{\theta^*}(x)\) by maximizing the likelihood of the parameters \(\theta\) given data \(x\). You shouldn’t confuse a maximum likelihood estimate \(\hat{\theta}(x)\) which is a realization of the maximum likelihood estimator for the data \(x\), with the maximum likelihood estimator \(\hat{\theta}\), so pay attention to disambiguate it in your head.

However, we need to incorporate multiple observations in this formulation, and by adding multiple observations we end up with a complex joint distribution:

$$\hat{\theta} = \mathrm{arg}\max_\theta p_{\theta}(x_1, x_2, \ldots, x_n)$$

That needs to take into account the interactions between all observations. And here is where we make a strong assumption: we state that the observations are independent. Independent random variables mean that the following holds:

$$p_{\theta}(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} p_{\theta}(x_i)$$

Which means that since \(x_1, x_2, \ldots, x_n\) don’t contain information about each other, we can write the joint probability as a product of their marginals.

Another assumption that is made, is that these random variables are identically distributed, which means that they came from the same generating distribution, which allows us to model it with the same distribution parametrization.

Given these two assumptions, which are also known as IID (independently and identically distributed), we can formulate our maximum likelihood estimation problem as:

$$\hat{\theta} = \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i)$$

Note that MLE doesn’t require you to make these assumptions, however, many problems will appear if you don’t to it, such as different distributions for each sample or having to deal with joint probabilities.

Given that in many cases these densities that we multiply can be very small, multiplying one by the other in the product that we have above we can end up with very small values. Here is where the logarithm function makes its way to the likelihood. The log function is a strictly monotonically increasing function, that preserves the location of the extrema and has a very nice property:

$$\log ab = \log a  + \log b $$

Where the logarithm of a product is the sum of the logarithms, which is very convenient for us, so we’ll apply the logarithm to the likelihood to maximize what is called the log-likelihood:

$$
\begin{align}
\hat{\theta} &= \mathrm{arg}\max_\theta \prod_{i=1}^{n} p_{\theta}(x_i) \\
&= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\
\end{align}
$$

As you can see, we went from a product to a summation, which is much more convenient. Another reason for the application of the logarithm is that we often take the derivative and solve it for the parameters, therefore is much easier to work with a summation than a multiplication.

We can also conveniently average the log-likelihood (given that we’re just including a multiplication by a constant):

$$
\begin{align}
\hat{\theta} &= \mathrm{arg}\max_\theta \sum_{i=1}^{n} \log p_{\theta}(x_i) \\
&= \mathrm{arg}\max_\theta \frac{1}{n} \sum_{i=1}^{n} \log p_{\theta}(x_i) \\
\end{align}
$$

This is also convenient because it will take out the dependency on the number of observations. We also know, that through the law of large numbers, the following holds as \(n\to\infty\):

$$
\frac{1}{n} \sum_{i=1}^{n} \log \, p_{\theta}(x_i) \approx \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]
$$

As you can see, we’re approximating the expectation with the empirical expectation defined by our dataset \(\{x_i\}_{i=1}^{n}\). This is an important point and it is usually implictly assumed.

The weak law of large numbers can be bounded using a Chebyshev bound, and if you are interested in concentration inequalities, I’ve made an article about them here where I discuss the Chebyshev bound.

To finish our formulation, given that we usually minimize objectives, we can formulate the same maximum likelihood estimation as the minimization of the negative of the log-likelihood:

$$
\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]
$$

Which is exactly the same thing with just the negation turn the maximization problem into a minimization problem.

The relation of maximum likelihood estimation with the Kullback–Leibler divergence from information theory

It is well-known that maximizing the likelihood is the same as minimizing the Kullback-Leibler divergence, also known as the KL divergence. Which is very interesting because it links a measure from information theory with the maximum likelihood principle.

The KL divergence is defined as:

$$
\begin{equation}
D_{KL}( p || q)=\int p(x)\log\frac{p(x)}{q(x)} \ dx
\end{equation}
$$

There are many intuitions to understand the KL divergence, I personally like the perspective on the likelihood ratios, however, there are plenty of materials about it that you can easily find and it’s out of the scope of this introduction.

The KL divergence is basically the expectation of the log-likelihood ratio under the \(p(x)\) distribution. What we’re doing below is just rephrasing it by using some identities and properties of the expectation:

$$
\begin{align}
D_{KL}[p_{\theta^*}(x) \, \Vert \, p_\theta(x)] &= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \frac{p_{\theta^*}(x)}{p_\theta(x)} \right] \\
\label{eq:logquotient}
&= \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \,p_{\theta^*}(x) – \log \, p_\theta(x) \right] \\
\label{eq:linearization}
&= \mathbb{E}_{x \sim p_{\theta^*}(x)} \underbrace{\left[\log \, p_{\theta^*}(x) \right]}_{\text{Entropy of } p_{\theta^*}(x)} – \underbrace{\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]}_{\text{Negative of log-likelihood}}
\end{align}
$$

In the formulation above, we’re first using the fact that the logarithm of a quotient is equal to the difference of the logs of the numerator and denominator (equation \(\ref{eq:logquotient}\)). After that we use the linearization of the expectation(equation \(\ref{eq:linearization}\)), which tells us that \(\mathbb{E}\left[X + Y\right] = \mathbb{E}\left[X\right]+\mathbb{E}\left[Y\right]\). In the end, we are left with two terms, the first one in the left is the entropy and the one in the right you can recognize as the negative of the log-likelihood that we saw earlier.

If we want to minimize the KL divergence for the \(\theta\), we can ignore the first term, since it doesn’t depend of \(\theta\) in any way, and in the end we have exactly the same maximum likelihood formulation that we saw before:

$$
\begin{eqnarray}
\require{cancel}
\theta^* &=& \mathrm{arg}\min_\theta \cancel{\mathbb{E}_{x \sim p_{\theta^*}(x)} \left[\log \, p_{\theta^*}(x) \right]} – \mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]\\
&=& \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(x)}\left[\log \, p_{\theta}(x) \right]
\end{eqnarray}
$$

The conditional log-likelihood

A very common scenario in Machine Learning is supervised learning, where we have data points \(x_n\) and their labels \(y_n\) building up our dataset \( D = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n) \} \), where we’re interested in estimating the conditional probability of \(\textbf{y}\) given \(\textbf{x}\), or more precisely \( P_{\theta}(Y | X) \).

To extend the maximum likelihood principle to the conditional case, we just have to write it as:

$$
\hat{\theta} = \mathrm{arg}\min_\theta -\mathbb{E}_{x \sim p_{\theta^*}(y | x)}\left[\log \, p_{\theta}(y | x) \right]
$$

And then it can be easily generalized to formulate the linear regression:

$$
p_{\theta}(y | x) \sim \mathcal{N}(x^T \theta, \sigma^2) \\
\log p_{\theta}(y | x) = -n \log \sigma – \frac{n}{2} \log{2\pi} – \sum_{i=1}^{n}{\frac{\| x_i^T \theta – y_i \|}{2\sigma^2}}
$$

In that case, you can see that we end up with a sum of squared errors that will have the same location of the optimum of the mean squared error (MSE). So you can see that minimizing the MSE is equivalent of maximizing the likelihood for a Gaussian model.

Remarks on the maximum likelihood

The maximum likelihood estimation has very interesting properties but it gives us only point estimates, and this means that we cannot reason on the distribution of these estimates. In contrast, Bayesian inference can give us a full distribution over parameters, and therefore will allow us to reason about the posterior distribution.

I’ll write more about Bayesian inference and sampling methods such as the ones from the Markov Chain Monte Carlo (MCMC) family, but I’ll leave this for another article, right now I’ll continue showing the relationship of the maximum likelihood estimator with the maximum a posteriori (MAP) estimator.

Maximum a posteriori

Although the maximum a posteriori, also known as MAP, also provides us with a point estimate, it is a Bayesian concept that incorporates a prior over the parameters. We’ll also see that the MAP has a strong connection with the regularized MLE estimation.

We know from the Bayes rule that we can get the posterior from the product of the likelihood and the prior, normalized by the evidence:

$$
\begin{align}
p(\theta \vert x) &= \frac{p_{\theta}(x) p(\theta)}{p(x)} \\
\label{eq:proport}
&\propto p_{\theta}(x) p(\theta)
\end{align}
$$

In the equation \(\ref{eq:proport}\), since we’re worried about optimization, we cancel the normalizing evidence \(p(x)\) and stay with a proportional posterior, which is very convenient because the marginalization of \(p(x)\) involves integration and is intractable for many cases.

$$
\begin{align}
\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} p_{\theta}(x) p(\theta) \\
&= \mathop{\rm arg\,max}\limits_{\theta} \prod_{i=1}^{n} p_{\theta}(x_i) p(\theta) \\
&= \mathop{\rm arg\,max}\limits_{\theta} \sum_{i=1}^{n} \underbrace{\log p_{\theta}(x_i)}_{\text{Log likelihood}} \underbrace{p(\theta)}_{\text{Prior}}
\end{align}
$$

In this formulation above, we just followed the same steps as described earlier for the maximum likelihood estimator, we assume independence and an identical distributional setting, followed later by the logarithm application to switch from a product to a summation. As you can see in the final formulation, this is equivalent as the maximum likelihood estimation multiplied by the prior term.

We can also easily recover the exact maximum likelihood estimator by using a uniform prior \(p(\theta) \sim \textbf{U}(\cdot, \cdot)\). This means that every possible value of \(\theta\) will be equally weighted, meaning that it’s just a constant multiplication:

$$
\begin{align}
\theta_{MAP} &= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) p(\theta) \\
&= \mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i) \, \text{constant} \\
&= \underbrace{\mathop{\rm arg\,max}\limits_{\theta} \sum_i \log p_{\theta}(x_i)}_{\text{Equivalent to maximum likelihood estimation (MLE)}} \\
\end{align}
$$

And there you are, the MAP with a uniform prior is equivalent to MLE. It is also easy to show that a Gaussian prior can recover the L2 regularized MLE. Which is quite interesting, given that it can provide insights and a new perspective on the regularization terms that we usually use.

I hope you liked this article ! The next one will be about Bayesian inference with posterior sampling, where we’ll show how we can reason about the posterior distribution and not only on point estimates as seen in MAP and MLE.

– Christian S. Perone

Cite this article as: Christian S. Perone, "A sane introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP)," in Terra Incognita, 02/01/2019, https://blog.christianperone.com/2019/01/mle/.
Machine Learning

Introducing EuclidesDB – A machine learning feature database

Past week I released the first public version of EuclidesDB. EuclidesDB is a multi-model machine learning feature database that is tightly coupled with PyTorch and provides a backend for including and querying data on the model feature space.

For more information, see the GitHub repository or the documentation.

Some features of EuclidesDB are listed below:

  • Written in C++ for performance;
  • Uses protobuf for data serialization;
  • Uses gRPC for communication;
  • LevelDB integration for database serialization;
  • Many indexing methods implemented (AnnoyFaiss, etc);
  • Tight PyTorch integration through libtorch;
  • Easy integration for new custom fine-tuned models;
  • Easy client language binding generation;
  • Free and open-source with permissive license;

And here is a diagram of the overall architecture:

Machine Learning, Python

Análise bayesiana dos microdados do ENEM do Rio Grande do Sul

* This post is in Portuguese. It’s a bayesian analysis of a Brazilian national exam. The main focus of the analysis is to understand the underlying factors impacting the participants performance on ENEM.

Este tutorial apresenta uma análise breve dos microdados do ENEM do Rio Grande do Sul do ano de 2017. O principal objetivo é entender os fatores que impactam na performance dos participantes do ENEM dado fatores como renda familiar e tipo de escola. Neste tutorial são apresentados dois modelos: regressão linear e regressão linear hierárquica.

(more…)

Machine Learning

Benford’s law emerges from deep language model

I was experimenting with the digits distribution from a pre-trained (weights from the OpenAI repositoryTransformer language model (LM) and I found a very interesting correlation between the Benford’s law and the digit distribution of the language model after conditioning it with some particular phrases.

Below is the correlation between the Benford’s law and the language model with conditioning on the phrase (shown in the figure):

 

Machine Learning, Python

PyTorch 1.0 tracing JIT and LibTorch C++ API to integrate PyTorch into NodeJS

Update 28 Feb 2019: I added a new blog post with a slide deck containing the presentation I did for PyData Montreal.

Today, at the PyTorch Developer Conference, the PyTorch team announced the plans and the release of the PyTorch 1.0 preview with many nice features such as a JIT for model graphs (with and without tracing) as well as the LibTorch, the PyTorch C++ API, one of the most important release announcement made today in my opinion.

Given the huge interest in understanding how this new API works, I decided to write this article showing an example of many opportunities that are now open after the release of the PyTorch C++ API. In this post, I’ll integrate PyTorch inference into native NodeJS using NodeJS C++ add-ons, just as an example of integration between different frameworks/languages that are now possible using the C++ API.

Below you can see the final result:

As you can see, the integration is seamless and I could use a traced ResNet as the computational graph model and feed any tensor to it to get the output predictions.

Introduction

Simply put, the libtorch is a library version of the PyTorch. It contains the underlying foundation that is used by PyTorch, such as the ATen (the tensor library), which contains all the tensor operations and methods. Libtorch also contains the autograd, which is the component that adds the automatic differentiation to the ATen tensors.

A word of caution for those who are starting now is to be careful with the use of the tensors that can be created both from ATen and autograd, do not mix them, the ATen will return the plain tensors (when you create them using the at namespace) while the autograd functions (from the torch namespace) will return Variable, by adding its automatic differentiation mechanism.

For a more extensive tutorial on how PyTorch internals work, please take a look on my previous tutorial on the PyTorch internal architecture.

Libtorch can be downloaded from the Pytorch website and it is only available as a preview for a while. You can also find the documentation in this site, which is mostly a Doxygen rendered documentation. I found the library pretty stable, and it makes sense because it is actually exposing the stable foundations of PyTorch, however, there are some issues with headers and some minor problems concerning the library organization that you might find while starting working with it (that will hopefully be fixed soon).

For NodeJS, I’ll use the Native Abstractions library (nan) which is the most recommended library (actually is basically a header-only library) to create NodeJS C++ add-ons and the cmake-js, because libtorch already provide the cmake files that make our building process much easier. However, the focus here will be on the C++ code and not on the building process.

The flow for the development, tracing, serializing and loading the model can be seen in the figure on the left side.

It starts with the development process and tracing being done in PyTorch (Python domain) and then the loading and inference on the C++ domain (in our case in NodeJS add-on).

 

Wrapping the Tensor

In NodeJS, to create an object as a first-class citizen of the JavaScript world, you need to inherit from the ObjectWrap class, which will be responsible for wrapping a C++ component.

#ifndef TENSOR_H
#define TENSOR_H

#include <nan.h>
#include <torch/torch.h>

namespace torchjs {

class Tensor : public Nan::ObjectWrap {
 public:
  static NAN_MODULE_INIT(Init);

  void setTensor(at::Tensor tensor) {
    this->mTensor = tensor;
  }

  torch::Tensor getTensor() {
    return this->mTensor;
  }

  static v8::Local<v8::Object> NewInstance();

 private:
  explicit Tensor();
  ~Tensor();

  static NAN_METHOD(New);
  static NAN_METHOD(toString);
  static Nan::Persistent<v8::Function> constructor;

 private:
  torch::Tensor mTensor;

};

}  // namespace torchjs

#endif

As you can see, most of the code for the definition of our Tensor class is just boilerplate. The key point here is that the torchjs::Tensor will wrap a torch::Tensor and we added two special public methods (setTensor and getTensor) to set and get this internal torch tensor.

I won’t show all the implementation details because most parts of it are NodeJS boilerplate code to construct the object, etc. I’ll focus on the parts that touch the libtorch API, like in the code below where we are creating a small textual representation of the tensor to show on JavaScript (toString method):

NAN_METHOD(Tensor::toString) {
  Tensor* obj = ObjectWrap::Unwrap<Tensor>(info.Holder());
  std::stringstream ss;

  at::IntList sizes = obj->mTensor.sizes();
  ss << "Tensor[Type=" << obj->mTensor.type() << ", ";
  ss << "Size=" << sizes << std::endl;

  info.GetReturnValue().Set(Nan::New(ss.str()).ToLocalChecked());
}

What we are doing in the code above, is just getting the internal tensor object from the wrapped object by unwrapping it. After that, we build a string representation with the tensor size (each dimension sizes) and its type (float, etc).

Wrapping Tensor-creation operations

Let’s create now a wrapper code for the torch::ones function which is responsible for creating a tensor of any defined shape filled with constant 1’s.

NAN_METHOD(ones) {
  // Sanity checking of the arguments
  if (info.Length() < 2)
    return Nan::ThrowError(Nan::New("Wrong number of arguments").ToLocalChecked());

  if (!info[0]->IsArray() || !info[1]->IsBoolean())
    return Nan::ThrowError(Nan::New("Wrong argument types").ToLocalChecked());

  // Retrieving parameters (require_grad and tensor shape)
  const bool require_grad = info[1]->BooleanValue();
  const v8::Local<v8::Array> array = info[0].As<v8::Array>();
  const uint32_t length = array->Length();

  // Convert from v8::Array to std::vector
  std::vector<long long> dims;
  for(int i=0; i<length; i++)
  {
    v8::Local<v8::Value> v;
    int d = array->Get(i)->NumberValue();
    dims.push_back(d);
  }

  // Call the libtorch and create a new torchjs::Tensor object
  // wrapping the new torch::Tensor that was created by torch::ones
  at::Tensor v = torch::ones(dims, torch::requires_grad(require_grad));
  auto newinst = Tensor::NewInstance();
  Tensor* obj = Nan::ObjectWrap::Unwrap<Tensor>(newinst);
  obj->setTensor(v);

  info.GetReturnValue().Set(newinst);
}

So, let’s go through this code. We are first checking the arguments of the function. For this function, we’re expecting a tuple (a JavaScript array) for the tensor shape and a boolean indicating if we want to compute gradients or not for this tensor node. After that, we’re converting the parameters from the V8 JavaScript types into native C++ types. Soon as we have the required parameters, we then call the torch::ones function from the libtorch, this function will create a new tensor where we use a torchjs::Tensor class that we created earlier to wrap it.

And that’s it, we just exposed one torch operation that can be used as native JavaScript operation.

Intermezzo for the PyTorch JIT

The introduced PyTorch JIT revolves around the concept of the Torch Script. A Torch Script is a restricted subset of the Python language and comes with its own compiler and transform passes (optimizations, etc).

This script can be created in two different ways: by using a tracing JIT or by providing the script itself. In the tracing mode, your computational graph nodes will be visited and operations recorded to produce the final script, while the scripting is the mode where you provide this description of your model taking into account the restrictions of the Torch Script.

Note that if you have branching decisions on your code that depends on external factors or data, tracing won’t work as you expect because it will record that particular execution of the graph, hence the alternative option to provide the script. However, in most of the cases, the tracing is what we need.

To understand the differences, let’s take a look at the Intermediate Representation (IR) from the script module generated both by tracing and by scripting.

@torch.jit.script
def happy_function_script(x):
    ret = torch.rand(0)
    if True == True:
        ret = torch.rand(1)
    else:
        ret = torch.rand(2)
    return ret

def happy_function_trace(x):
    ret = torch.rand(0)
    if True == True:
        ret = torch.rand(1)
    else:
        ret = torch.rand(2)
    return ret

traced_fn = torch.jit.trace(happy_function_trace,
                            (torch.tensor(0),),
                            check_trace=False)

In the code above, we’re providing two functions, one is using the @torch.jit.script decorator, and it is the scripting way to create a Torch Script, while the second function is being used by the tracing function torch.jit.trace. Not that I intentionally added a “True == True” decision on the functions (which will always be true).

Now, if we inspect the IR generated by these two different approaches, we’ll clearly see the difference between the tracing and scripting approaches:

# 1) Graph from the scripting approach
graph(%x : Dynamic) {
  %16 : int = prim::Constant[value=2]()
  %10 : int = prim::Constant[value=1]()
  %7 : int = prim::Constant[value=1]()
  %8 : int = prim::Constant[value=1]()
  %9 : int = aten::eq(%7, %8)
  %ret : Dynamic = prim::If(%9)
    block0() {
      %11 : int[] = prim::ListConstruct(%10)
      %12 : int = prim::Constant[value=6]()
      %13 : int = prim::Constant[value=0]()
      %14 : int[] = prim::Constant[value=[0, -1]]()
      %ret.2 : Dynamic = aten::rand(%11, %12, %13, %14)
      -> (%ret.2)
    }
    block1() {
      %17 : int[] = prim::ListConstruct(%16)
      %18 : int = prim::Constant[value=6]()
      %19 : int = prim::Constant[value=0]()
      %20 : int[] = prim::Constant[value=[0, -1]]()
      %ret.3 : Dynamic = aten::rand(%17, %18, %19, %20)
      -> (%ret.3)
    }
  return (%ret);
}

# 2) Graph from the tracing approach
graph(%0 : Long()) {
  %7 : int = prim::Constant[value=1]()
  %8 : int[] = prim::ListConstruct(%7)
  %9 : int = prim::Constant[value=6]()
  %10 : int = prim::Constant[value=0]()
  %11 : int[] = prim::Constant[value=[0, -1]]()
  %12 : Float(1) = aten::rand(%8, %9, %10, %11)
  return (%12);
}

​

As we can see, the IR is very similar to the LLVM IR, note that in the tracing approach, the trace recorded contains only one path from the code, the truth path, while in the scripting we have both branching alternatives. However, even in scripting, the always false branch can be optimized and removed with a dead code elimination transform pass.

PyTorch JIT has a lot of transformation passes that are used to do loop unrolling, dead code elimination, etc. You can find these passes here. Not that conversion to other formats such as ONNX can be implemented as a pass on top of this intermediate representation (IR), which is quite convenient.

Tracing the ResNet

Now, before implementing the Script Module in NodeJS, let’s first trace a ResNet network using PyTorch (using just Python):

traced_net = torch.jit.trace(torchvision.models.resnet18(),
                             torch.rand(1, 3, 224, 224))
traced_net.save("resnet18_trace.pt")

As you can see from the code above, we just have to provide a tensor example (in this case a batch of a single image with 3 channels and size 224×224. After that we just save the traced network into a file called resnet18_trace.pt.

Now we’re ready to implement the Script Module in NodeJS in order to load this file that was traced.

Wrapping the Script Module

This is now the implementation of the Script Module in NodeJS:

// Class constructor
ScriptModule::ScriptModule(const std::string filename) {
  // Load the traced network from the file
  this->mModule = torch::jit::load(filename);
}

// JavaScript object creation
NAN_METHOD(ScriptModule::New) {
  if (info.IsConstructCall()) {
    // Get the filename parameter
    v8::String::Utf8Value param_filename(info[0]->ToString());
    const std::string filename = std::string(*param_filename);

    // Create a new script module using that file name
    ScriptModule *obj = new ScriptModule(filename);
    obj->Wrap(info.This());
    info.GetReturnValue().Set(info.This());
  } else {
    v8::Local<v8::Function> cons = Nan::New(constructor);
    info.GetReturnValue().Set(Nan::NewInstance(cons).ToLocalChecked());
  }
}

As you can see from the code above, we’re just creating a class that will call the torch::jit::load function passing a file name of the traced network. We also have the implementation of the JavaScript object, where we convert parameters to C++ types and then create a new instance of the torchjs::ScriptModule.

The wrapping of the forward pass is also quite straightforward:

NAN_METHOD(ScriptModule::forward) {
  ScriptModule* script_module = ObjectWrap::Unwrap<ScriptModule>(info.Holder());
  Nan::MaybeLocal<v8::Object> maybe = Nan::To<v8::Object>(info[0]);

  Tensor *tensor =
    Nan::ObjectWrap::Unwrap<Tensor>(maybe.ToLocalChecked());
  
  torch::Tensor torch_tensor = tensor->getTensor();
  torch::Tensor output = script_module->mModule->forward({torch_tensor}).toTensor();

  auto newinst = Tensor::NewInstance();
  Tensor* obj = Nan::ObjectWrap::Unwrap<Tensor>(newinst);
  obj->setTensor(output);
  info.GetReturnValue().Set(newinst);
}

As you can see, in this code, we just receive a tensor as an argument, we get the internal torch::Tensor from it and then call the forward method from the script module, we wrap the output on a new torchjs::Tensor and then return it.

And that’s it, we’re ready to use our built module in native NodeJS as in the example below:

var torchjs = require("./build/Release/torchjs");
var script_module = new torchjs.ScriptModule("resnet18_trace.pt");
var data = torchjs.ones([1, 3, 224, 224], false);
var output = script_module.forward(data);

I hope you enjoyed ! Libtorch opens the door for the tight integration of PyTorch in many different languages and frameworks, which is quite exciting and a huge step towards the direction of production deployment code.

– Christian S. Perone

Cite this article as: Christian S. Perone, "PyTorch 1.0 tracing JIT and LibTorch C++ API to integrate PyTorch into NodeJS," in Terra Incognita, 02/10/2018, https://blog.christianperone.com/2018/10/pytorch-1-0-tracing-jit-and-libtorch-c-api-to-integrate-pytorch-into-nodejs/.
Machine Learning, Math, Probability

Concentration inequalities – Part I

Introduction

Concentration inequalities, or probability bounds, are very important tools for the analysis of Machine Learning algorithms or randomized algorithms. In statistical learning theory, we often want to show that random variables, given some assumptions, are close to its expectation with high probability. This article provides an overview of the most basic inequalities in the analysis of these concentration measures.

Markov’s Inequality

The Markov’s inequality is one of the most basic bounds and it assumes almost nothing about the random variable. The assumptions that Markov’s inequality makes is that the random variable \(X\) is non-negative \(X > 0\) and has a finite expectation \(\mathbb{E}\left[X\right] < \infty\). The Markov’s inequality is given by:

$$\underbrace{P(X \geq \alpha)}_{\text{Probability of being greater than constant } \alpha} \leq \underbrace{\frac{\mathbb{E}\left[X\right]}{\alpha}}_{\text{Bounded above by expectation over constant } \alpha}$$

What this means is that the probability that the random variable \(X\) will be bounded by the expectation of \(X\) divided by the constant \(\alpha\). What is remarkable about this bound, is that it holds for any distribution with positive values and it doesn’t depend on any feature of the probability distribution, it only requires some weak assumptions and its first moment, the expectation.

Example: A grocery store sells an average of 40 beers per day (it’s summer !). What is the probability that it will sell 80 or more beers tomorrow ?

$$
\begin{align}
P(X \geq \alpha) & \leq\frac{\mathbb{E}\left[X\right]}{\alpha} \\\\
P(X \geq 80) & \leq\frac{40}{80} = 0.5 = 50\%
\end{align}
$$

The Markov’s inequality doesn’t depend on any property of the random variable probability distribution, so it’s obvious that there are better bounds to use if information about the probability distribution is available.

Chebyshev’s Inequality

When we have information about the underlying distribution of a random variable, we can take advantage of properties of this distribution to know more about the concentration of this variable. Let’s take for example a normal distribution with mean \(\mu = 0\) and unit standard deviation \(\sigma = 1\) given by the probability density function (PDF) below:

$$ f(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} $$

Integrating from -1 to 1: \(\int_{-1}^{1} \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\), we know that 68% of the data is within \(1\sigma\) (one standard deviation) from the mean \(\mu\) and 95% is within \(2\sigma\) from the mean. However, when it’s not possible to assume normality, any other amount of data can be concentrated within \(1\sigma\) or \(2\sigma\).

Chebyshev’s inequality provides a way to get a bound on the concentration for any distribution, without assuming any underlying property except a finite mean and variance. Chebyshev’s also holds for any random variable, not only for non-negative variables as in Markov’s inequality.

The Chebyshev’s inequality is given by the following relation:

$$
P( \mid X – \mu \mid \geq k\sigma) \leq \frac{1}{k^2}
$$

that can also be rewritten as:

$$
P(\mid X – \mu \mid < k\sigma) \geq 1 – \frac{1}{k^2}
$$

For the concrete case of \(k = 2\), the Chebyshev’s tells us that at least 75% of the data is concentrated within 2 standard deviations of the mean. And this holds for any distribution.

Now, when we compare this result for \( k = 2 \) with the 95% concentration of the normal distribution for \(2\sigma\), we can see how conservative is the Chebyshev’s bound. However, one must not forget that this holds for any distribution and not only for a normally distributed random variable, and all that Chebyshev’s needs, is the first and second moments of the data. Something important to note is that in absence of more information about the random variable, this cannot be improved.

Chebyshev’s Inequality and the Weak Law of Large Numbers

Chebyshev’s inequality can also be used to prove the weak law of large numbers, which says that the sample mean converges in probability towards the true mean.

That can be done as follows:

  • Consider a sequence of i.i.d. (independent and identically distributed) random variables \(X_1, X_2, X_3, \ldots\) with mean \(\mu\) and variance \(\sigma^2\);
  • The sample mean is \(M_n = \frac{X_1 + \ldots + X_n}{n}\) and the true mean is \(\mu\);
  • For the expectation of the sample mean we have: $$\mathbb{E}\left[M_n\right] = \frac{\mathbb{E}\left[X_1\right] + \ldots +\mathbb{E}\left[X_n\right]}{n} = \frac{n\mu}{n} = \mu$$
  • For the variance of the sample we have: $$Var\left[M_n\right] = \frac{Var\left[X_1\right] + \ldots +Var\left[X_n\right]}{n^2} = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$$
  • By the application of the Chebyshev’s inequality we have: $$ P(\mid M_n – \mu \mid \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2}$$ for any (fixed) \(\epsilon > 0\), as \(n\) increases, the right side of the inequality goes to zero. Intuitively, this means that for a large \(n\) the concentration of the distribution of \(M_n\) will be around \(\mu\).

Improving on Markov’s and Chebyshev’s with Chernoff Bounds

Before getting into the Chernoff bound, let’s understand the motivation behind it and how one can improve on Chebyshev’s bound. To understand it, we first need to understand the difference between a pairwise independence and mutual independence. For the pairwise independence, we have the following for A, B, and C:

$$
P(A \cap B) = P(A)P(B) \\
P(A \cap C) = P(A)P(C) \\
P(B \cap C) = P(B)P(C)
$$

Which means that any pair (any two events) are independent, but not necessarily that:

$$
P(A \cap B\cap C) = P(A)P(B)P(C)
$$

which is called “mutual independence” and it is a stronger independence. By definition, the mutual independence assumes the pairwise independence but the opposite isn’t always true. And this is the case where we can improve on Chebyshev’s bound, as it is not possible without doing these further assumptions (stronger assumptions leads to stronger bounds).

We’ll talk about the Chernoff bounds in the second part of this tutorial !

Cite this article as: Christian S. Perone, "Concentration inequalities – Part I," in Terra Incognita, 23/08/2018, https://blog.christianperone.com/2018/08/concentration-inequalities-part-i/.
Article, Machine Learning, Philosophy

NLP word representations and the Wittgenstein philosophy of language

I made an introductory talk on word embeddings in the past and this write-up is an extended version of the part about philosophical ideas behind word vectors. The aim of this article is to provide an introduction to Ludwig Wittgenstein’s main ideas on linguistics that are closely related to techniques that are distributional (I’ll talk what this means later) by design, such as word2vec [Mikolov et al., 2013], GloVe [Pennington et al., 2014], Skip-Thought Vectors [Kiros et al., 2015], among others.

One of the most interesting aspects of Wittgenstein is perhaps that fact that he had developed two very different philosophies during his life, and each of which had great influence. Something quite rare for someone who spent so much time working on these ideas and retreating even after the major influence they exerted, especially in the Vienna Circle. A true lesson of intellectual honesty, and in my opinion, one important legacy.

Wittgenstein was an avid reader of the Schopenhauer’s philosophy, and in the same way that Schopenhauer inherited his philosophy from Kant, especially regarding the division of what can be experimented (phenomena) or not (noumena), contrasting things as they appear for us from things as they are in themselves, Wittgenstein concluded that Schopenhauer philosophy was fundamentally right. He believed that in the noumena realm, we have no conceptual understanding and therefore we will never be able to say anything (without becoming nonsense), in contrast to the phenomena realm of our experience, where we can indeed talk about and try to understand. By adding secure foundations, such as logic, to the phenomenal world, he was able to reason about how the world is describable by language and thus mapping what are the limits of how and what can be expressed in language or in conceptual thought.

The first main theory of language from Wittgenstein, described in his Tractatus Logico-Philosophicus, is known as the “Picture theory of language” (aka Picture theory of meaning). This theory is based on an analogy with painting, where Wittgenstein realized that a painting is something very different than a natural landscape, however, a skilled painter can still represent the real landscape by placing patches or strokes corresponding to the natural landscape reality. Wittgenstein gave the name “logical form” to this set of relationships between the painting and the natural landscape. This logical form, the set of internal relationships common to both representations, is why the painter was able to represent reality because the logical form was the same in both representations (here I call both as “representations” to be coherent with Schopenhauer and Kant terms because the reality is also a representation for us, to distinguish between it and the thing-in-itself).

This theory was important, especially in our context (NLP), because Wittgenstein realized that the same thing happens with language. We are able to assemble words in sentences to match the same logical form of what we want to describe. The logical form was the core idea that made us able to talk about the world. However, later Wittgenstein realized that he had just picked a single task, out of the vast amount of tasks that language can perform and created a whole theory of meaning around it.

The fact is, language can do many other tasks besides representing (picturing) the reality. With language, as Wittgenstein noticed, we can give orders, and we can’t say that this is a picture of something. Soon as he realized these counter-examples, Wittgenstein abandoned the picture theory of language and adopted a much more powerful metaphor of a tool. And here we’re approaching the modern view of the meaning in language as well as the main foundational idea behind many modern Machine Learning techniques for word/sentence representations that works quite well. Once you realize that language works as a tool, if you want to understand the meaning of it, you just need to understand all the possible things you can do with it. And if you take for instance a word or concept in isolation, the meaning of it is the sum of all its uses, and this meaning is fluid and can have many different faces. This important thought can be summarized in the well-known quote below:

The meaning of a word is its use in the language.

(…)

One cannot guess how a word functions. One has to look at its use, and learn from that.

– Ludwig Wittgenstein, Philosophical Investigations

And indeed it makes complete sense because once you exhaust all the uses of a word, there is nothing left on it. Reality is also by far more fluid than usually thought, because:

Our language can be seen as an ancient city: a maze of little streets and squares, of old and new houses, and of houses with additions from various periods (…)

– Ludwig Wittgenstein, Philosophical Investigations

John R. Firth was a linguist also known for the popularization of this context-dependent nature of the meaning who also used Wittgenstein’s Philosophical Investigations as a recourse to emphasize the importance of the context in meaning, in which I quote below:

The placing of a text as a constituent in a context of situation contributes to the statement of meaning since situations are set up to recognize use. As Wittgenstein says, ‘the meaning of words lies in their use.’ (Phil. Investigations, 80, 109). The day-to-day practice of playing language games recognizes customs and rules. It follows that a text in such established usage may contain sentences such as ‘Don’t be such an ass !’, ‘You silly ass !’, ‘What an ass he is !’ In these examples, the word ass is in familiar and habitual company, commonly collocated with you silly-, he is a silly-, don’t be such an-. You shall know a word by the company it keeps ! One of the meanings of ass is its habitual collocation with such other words as those above quoted. Though Wittgenstein was dealing with another problem, he also recognizes the plain face-value, the physiognomy of words. They look at us ! ‘The sentence is composed of words and that is enough’.

– John R. Firth

This idea of learning the meaning of a word by the company it keeps is exactly what word2vec (and other count-based methods based on co-occurrence as well) is doing by means of data and learning on an unsupervised fashion with a supervised task that was by design built to predict context (or vice-versa, depending if you use skip-gram or cbow), which was also a source of inspiration for the Skip-Thought Vectors. Nowadays, this idea is also known as the “Distributional Hypothesis“, which is also being used on fields other than linguistics.

Now, it is quite amazing that if we look at the work by Neelakantan, et al., 2015, called “Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space“, where they mention about an important deficiency in word2vec in which each word type has only one vector representation, you’ll see that this has deep philosophical motivations if we relate it to the Wittgenstein and Firth ideas, because, as Wittgenstein noticed, the meaning of a word is unlikely to wear a single face and word2vec seems to be converging to an approximation of the average meaning of a word instead of capturing the polysemy inherent in language.

A concrete example of the multi-faceted nature of words can be seen in the example of the word “evidence”, where the meaning can be quite different to a historian, a lawyer and a physicist. The hearsay cannot count as evidence in a court while it is many times the only evidence that a historian has, whereas the hearsay doesn’t even arise in physics. Recent works such as ELMo [Peters, Matthew E. et al. 2018], which used different levels of features from a LSTM trained with a language model objective are also a very interesting direction with excellent results towards incorporating a context-dependent semantics into the word representations and breaking the tradition of shallow representations as seen in word2vec.

We’re in an exciting time where it is really amazing to see how many deep philosophical foundations are actually hidden in Machine Learning techniques. It is also very interesting that we’re learning a lot of linguistic lessons from Machine Learning experimentation, that we can see as important means for discovery that is forming an amazing virtuous circle. I think that we have never been self-conscious and concerned with language as in the past years.

I really hope you enjoyed reading this !

– Christian S. Perone

Cite this article as: Christian S. Perone, "NLP word representations and the Wittgenstein philosophy of language," in Terra Incognita, 23/05/2018, https://blog.christianperone.com/2018/05/nlp-word-representations-and-the-wittgenstein-philosophy-of-language/.

References

Magee, Bryan. The history of philosophy. 1998.

Mikolov, Thomas et al. Efficient Estimation of Word Representations in Vector Space. 2013. https://arxiv.org/abs/1301.3781

Pennington, Jeffrey et al. GloVe: Global Vectors for Word Representation. 2014. https://nlp.stanford.edu/projects/glove/

Kiros, Ryan et al. Skip-Thought Vectors. 2015. https://arxiv.org/abs/1506.06726

Neelakantan, Arvind et al. Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space. 2015. https://arxiv.org/abs/1504.06654

Léon, Jacqueline. Meaning by collocation. The Firthian filiation of Corpus Linguistics. 2007.

Machine Learning, Programming, Python

PyTorch – Internal Architecture Tour

Update 28 Feb 2019: I added a new blog post with a slide deck containing the presentation I did for PyData Montreal.

Introduction

This post is a tour around the PyTorch codebase, it is meant to be a guide for the architectural design of PyTorch and its internals. My main goal is to provide something useful for those who are interested in understanding what happens beyond the user-facing API and show something new beyond what was already covered in other tutorials.

Note: PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

Short intro to Python extension objects in C/C++

As you probably know, you can extend Python using C and C++ and develop what is called as “extension”. All the PyTorch heavy work is implemented in C/C++ instead of pure-Python. To define a new Python object type in C/C++, you define a structure like this one example below (which is the base for the autograd Variable class):

// Python object that backs torch.autograd.Variable
struct THPVariable {
    PyObject_HEAD
    torch::autograd::Variable cdata;
    PyObject* backward_hooks;
};

As you can see, there is a macro at the beginning of the definition, called PyObject_HEAD, this macro’s goal is the standardization of Python objects and will expand to another structure that contains a pointer to a type object (which defines initialization methods, allocators, etc) and also a field with a reference counter.

There are two extra macros in the Python API called Py_INCREF() and Py_DECREF(), which are used to increment and decrement the reference counter of Python objects. Multiple entities can borrow or own a reference to other objects (the reference counter is increased), and only when this reference counter reaches zero (when all references get destroyed), Python will automatically delete the memory from that object using its garbage collector.

You can read more about Python C/++ extensions here.

Funny fact: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the official CPython interpreter caches the integers from -5 up to 256. For that reason, the statement a = 200; b = 200; a is b will be True, while the statement a = 300; b = 300; a is b will be False.

Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods called from_numpy() and numpy(), that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

at::Tensor tensor_from_numpy(PyObject* obj) {
  if (!PyArray_Check(obj)) {
    throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name);
  }

  auto array = (PyArrayObject*)obj;
  int ndim = PyArray_NDIM(array);
  auto sizes = to_aten_shape(ndim, PyArray_DIMS(array));
  auto strides = to_aten_shape(ndim, PyArray_STRIDES(array));
  // NumPy strides use bytes. Torch strides use element counts.
  auto element_size_in_bytes = PyArray_ITEMSIZE(array);
  for (auto& stride : strides) {
    stride /= element_size_in_bytes;
  }

  // (...) - omitted for brevity

  void* data_ptr = PyArray_DATA(array);
  auto& type = CPU(dtype_to_aten(PyArray_TYPE(array)));
  Py_INCREF(obj);
  return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) {
    AutoGIL gil;
    Py_DECREF(obj);
  });
}

(code from tensor_numpy.cpp)

As you can see from this code, PyTorch is obtaining all information (array metadata) from Numpy representation and then creating its own. However, as you can note from the marked line 18, PyTorch is getting a pointer to the internal Numpy array raw data instead of copying it. This means that PyTorch will create a reference for this data, sharing the same memory region with the Numpy array object for the raw Tensor data.

There is also an important point here: when Numpy array object goes out of scope and get a zero reference count, it will be garbage collected and destroyed, that’s why there is an increment in the reference counting of the Numpy array object at line 20.

After this, PyTorch will create a new Tensor object from this Numpy data blob, and in the creation of this new Tensor it passes the borrowed memory data pointer, together with the memory size and strides as well as a function that will be used later by the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

The tensorFromBlob() method will create a new Tensor, but only after creating a new “Storage” for this Tensor. The storage is where the actual data pointer will be stored (and not in the Tensor structure itself). This takes us to the next section about Tensor Storages.

Tensor Storage

The actual raw data of the Tensor is not directly kept in the Tensor structure, but on another structure called Storage, which in turn is part of the Tensor structure.

As we saw in the previous code from tensor_from_numpy(), there is a call for tensorFromBlob() that will create a Tensor from the raw data blob. This last function will call another function called storageFromBlob() that will, in turn, create a storage for this data according to its type. In the case of a CPU float type, it will return a new CPUFloatStorage instance.

The CPUFloatStorage is basically a wrapper with utility functions around the actual storage structure called THFloatStorage that we show below:

typedef struct THStorage
{
    real *data;
    ptrdiff_t size;
    int refcount;
    char flag;
    THAllocator *allocator;
    void *allocatorContext;
    struct THStorage *view;
} THStorage;

(code from THStorage.h)

As you can see, the THStorage holds a pointer to the raw data, its size, flags and also an interesting field called allocator that we’ll soon discuss. It is also important to note that there is no metadata regarding on how to interpret the data inside the THStorage, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.

From this, you already probably realized that we can have multiple tensors pointing to the same storage but with different views of this data, and that’s why viewing a tensor with a different shape (but keeping the same number of elements) is so efficient. This Python code below shows that the data pointer in the storage is being shared after changing the way Tensor views its data:

>>> tensor_a = torch.ones((3, 3))
>>> tensor_b = tensor_a.view(9)
>>> tensor_a.storage().data_ptr() == tensor_b.storage().data_ptr()
True

As we can see in the example above, the data pointer on the storage of both Tensors are the same, but the Tensors represent a different interpretation of the storage data.

Now, as we saw in line 7 of the THFloatStorage structure, there is a pointer to a THAllocator structure there. And this is very important because it brings flexibility regarding the allocator that can be used to allocate the storage data. This structure is represented by the following code:

typedef struct THAllocator
{
  void* (*malloc)(void*, ptrdiff_t);
  void* (*realloc)(void*, void*, ptrdiff_t);
  void (*free)(void*, void*);
} THAllocator;

(code from THAllocator.h)

As you can see, there are three function pointer fields in this structure to define what an allocator means: a malloc, realloc and free. For CPU-allocated memory, these functions will, of course, relate to the traditional malloc/realloc/free POSIX functions, however, when we want a storage allocated on GPUs we’ll end up using the CUDA allocators such as the cudaMallocHost(), like we can see in the THCudaHostAllocator malloc function below:

static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) {
  void* ptr;
  if (size < 0) THError("Invalid memory size: %ld", size);
  if (size == 0) return NULL;
  THCudaCheck(cudaMallocHost(&ptr, size));
  return ptr;
}

(code from THCAllocator.c)

You probably noticed a pattern in the repository organization, but it is important to keep in mind these conventions when navigating the repository, as summarized here (taken from the PyTorch lib readme):

  • TH = TorcH
  • THC = TorcH Cuda
  • THCS = TorcH Cuda Sparse
  • THCUNN = TorcH CUda Neural Network
  • THD = TorcH Distributed
  • THNN = TorcH Neural Network
  • THS = TorcH Sparse

This convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

Finally, we can see the composition of the main Tensor THTensor structure:

typedef struct THTensor
{
    int64_t *size;
    int64_t *stride;
    int nDimension;
    THStorage *storage;
    ptrdiff_t storageOffset;
    int refcount;
    char flag;
} THTensor;

(Code from THTensor.h)

And as you can see, the main THTensor structure holds the size/strides/dimensions/offsets/etc as well as the storage (THStorage) for the Tensor data.

We can summarize all this structure that we saw in the diagram below:

Now, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implement Hogwild training procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

Shared Memory

Shared memory can be implemented in many different ways depending on the platform support. PyTorch supports some of them, but for the sake of simplicity, I’ll talk here about what happens on MacOS using the CPU (instead of GPU). Since PyTorch supports multiple shared memory approaches, this part is a little tricky to grasp into since it involves more levels of indirection in the code.

PyTorch provides a wrapper around the Python multiprocessing module and can be imported from torch.multiprocessing. The changes they implemented in this wrapper around the official Python multiprocessing were done to make sure that everytime a tensor is put on a queue or shared with another process, PyTorch will make sure that only a handle for the shared memory will be shared instead of a new entire copy of the Tensor.

Now, many people aren’t aware of a Tensor method from PyTorch called share_memory_(), however, this function is what triggers an entire rebuild of the storage memory for that particular Tensor. What this method does is to create a region of shared memory that can be used among different processes. This function will, in the end, call this following function below:

static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size)
{
  int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE;
  std::string handle = THPStorage_(__newHandle)();
  auto ctx = libshm_context_new(NULL, handle.c_str(), flags);
  return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx);
}

(Code from StorageSharing.cpp)

And as you can see, this function will create another storage using a special allocator called THManagedSharedAllocator. This function first defines some flags and then it creates a handle which is a string in the format /torch_[process id]_[random number], and after that, it will then create a new storage using the special THManagedSharedAllocator. This allocator has function pointers to an internal PyTorch library called libshm, that will implement a Unix Domain Socket communication to share the shared memory region handles. This allocator is actual an especial case and it is a kind of “smart allocator” because it contains the communication control logic as well as it uses another allocator called THRefcountedMapAllocator that will be responsible for creating the actual shared memory region and call mmap() to map this region to the process virtual address space.

Note: when a method ends with a underscore in PyTorch, such as the method called share_memory_(), it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

>>> import torch
>>> tensor_a = torch.ones((5, 5))
>>> tensor_a

 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
[torch.FloatTensor of size 5x5]

>>> tensor_a.is_shared()
False
>>> tensor_a = tensor_a.share_memory_()
>>> tensor_a.is_shared()
True
>>> tensor_a_storage = tensor_a.storage()
>>> tensor_a_storage._share_filename_()
(b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25)

In this code, executed in the process A, we create a new Tensor of 5×5 filled with ones. After that we make it shared and print the tuple with the Unix Domain Socket address as well as the handle. Now we can access this memory region from another process B as shown below:

Code executed in the process B:

>>> import torch
>>> tensor_a = torch.Tensor()
>>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25)
>>> storage = torch.Storage._new_shared_filename(*tuple_info)
>>> tensor_a = torch.Tensor(storage).view((5, 5))

 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
[torch.FloatTensor of size 5x5]

As you can see, using the tuple information about the Unix Domain Socket address and the handle we were able to access the Tensor storage from another process. If you change the tensor in this process B, you’ll also see that it will reflect in the process A because these Tensors are sharing the same memory region.

DLPack: a hope for the Deep Learning frameworks Babel

Now I would like to talk about something recent in the PyTorch code base, that is called DLPack. DLPack is an open standardization of an in-memory tensor structure that will allow exchange tensor data between frameworks, and what is quite interesting is that since this memory representation is standardized and very similar to the memory representation already in use by many frameworks, it will allow a zero-copy data sharing between frameworks, which is a quite amazing initiative given the variety of frameworks we have today without inter-communication among them.

This will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

The core of DLPack os a very simple structure called DLTensor, as shown below:

/*!
 * \brief Plain C Tensor object, does not manage memory.
 */
typedef struct {
  /*!
   * \brief The opaque data pointer points to the allocated data.
   *  This will be CUDA device pointer or cl_mem handle in OpenCL.
   *  This pointer is always aligns to 256 bytes as in CUDA.
   */
  void* data;
  /*! \brief The device context of the tensor */
  DLContext ctx;
  /*! \brief Number of dimensions */
  int ndim;
  /*! \brief The data type of the pointer*/
  DLDataType dtype;
  /*! \brief The shape of the tensor */
  int64_t* shape;
  /*!
   * \brief strides of the tensor,
   *  can be NULL, indicating tensor is compact.
   */
  int64_t* strides;
  /*! \brief The offset in bytes to the beginning pointer to data */
  uint64_t byte_offset;
} DLTensor;

(code from dlpack.h)

As you can see, there is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that the DLTensor pointing to.

There is also a managed version of the tensor that is called DLManagedTensor, where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

import torch
from torch.utils import dlpack

t = torch.ones((5, 5))
dl = dlpack.to_dlpack(t)

This Python function will call the toDLPack function from ATen, shown below:

DLManagedTensor* toDLPack(const Tensor& src) {
  ATenDLMTensor * atDLMTensor(new ATenDLMTensor);
  atDLMTensor->handle = src;
  atDLMTensor->tensor.manager_ctx = atDLMTensor;
  atDLMTensor->tensor.deleter = &deleter;
  atDLMTensor->tensor.dl_tensor.data = src.data_ptr();
  int64_t device_id = 0;
  if (src.type().is_cuda()) {
    device_id = src.get_device();
  }
  atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id);
  atDLMTensor->tensor.dl_tensor.ndim = src.dim();
  atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type());
  atDLMTensor->tensor.dl_tensor.shape = const_cast<int64_t*>(src.sizes().data());
  atDLMTensor->tensor.dl_tensor.strides = const_cast<int64_t*>(src.strides().data());
  atDLMTensor->tensor.dl_tensor.byte_offset = 0;
  return &(atDLMTensor->tensor);
}

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

I really hope that more frameworks adopt this standard that will certainly give benefits to the ecosystem. It is also interesting to note that a potential integration with Apache Arrow would be amazing.

That’s it, I hope you liked this long post !

– Christian S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," in Terra Incognita, 12/03/2018, https://blog.christianperone.com/2018/03/pytorch-internal-architecture-tour/.