Jacob Reinhold

Image enhancement and the data processing inequality

Jacob Reinhold — Fri, 23 Oct 2020 22:36:40 GMT

Information theory studies the communication of information with the language of probability theory and statistics. Claude Shannon laid out the foundations of the field with his work A Mathematical Theory of Communication in 1948. The paper defines a reasonable measure of information (more specifically, a measure of average information), and proves some results regarding how efficiently you can communicate over a noiseless and noisy channel (e.g., radio over the air or storage of data in an imperfect medium like a hard drive). Since that time, information theory has continued to be studied and is partially behind many of the advances in telecommunications, and it provides some useful methods of analysis for statistics and machine learning.

While many results of information theory have practical importance, there is one inequality that is often misguidedly invoked; namely, the data processing inequality. We'll quickly overview the basics of information theory so that we can properly motivate the inequality, and then we'll discuss its ability and its limitations in illuminating the impact of data processing. Specifically, we'll discuss how it applies to image enhancement such as super-resolution; we'll show that it is both an essential component of analysis because it shows that image enhancement cannot create new information, and that it is useless in practice because it is the wrong way to analyze image enhancement methods.

Information and entropy

Mathematically, information is defined as the logarithm of the reciprocal of a probability mass, i.e., for a (discrete) random variable $x\sim p(x)$, the information gained by observing $x=a$ for $x\in\mathcal{X}$ is

$$h(x=a)=\log_2 \frac{1}{p(a)}.$$

The heuristic argument for this quantity is that rare events should communicate more information than common events—the more surprising an event, the more you update your beliefs upon observing the event—which is why the reciprocal of the probability mass is used. The intuition behind the logarithm is more complex. I'll motivate it in the sense of optimal code-lengths.

First we need to define a (source) code which is a mapping from some set of events (i.e., whatever $\mathcal{X}$ represents) to a set of strings of symbols (e.g., strings of 0s and 1s). Then the log of the reciprocal of the probability mass of an event is equal to a (competitively) optimal code-length that describes the event, where "optimality" is in the sense of maximal average compression of a sequence of transmitted symbols.

For example, if some event $x=b$ has $p(b)=1/2$, then the optimal (source) code-length is 1. If some event $x=c$ has $p(c)=1/256$, then the optimal code-length is 8. Why is encoding in this way optimal? The intuition is that common events should be encoded with short codes because they are frequently encountered, and short codes reduce the total number of symbols that you need to transmit to communicate a message. The opposite holds for rare events; you pay little penalty for encoding rare sequences with long codes because they are communicated infrequently. In the end, you'll reduce the average number of symbols used to communicate a string of events if you encode in this way.

To show this mathematically, let $f:\mathcal{X}\to\{0,1\}^m$ describe the encoder—i.e., the map from events $\mathcal{X}$ to strings of symbols $\{0,1\}^m$ (where $m$ can possibly vary)—and let $\ell(f(x)):\{0,1\}^m\to\mathbb{N}$ represent the length of the code associated with event $x\in\mathcal{X}$. Then the average length of this encoding scheme is:

$$\mathbb{E}[\ell(f(X))]=\sum_{x\in\mathcal{X}}p(x)\ell(f(x))=\sum_{x\in\mathcal{X}}p(x)\log_2\frac{1}{p(x)}=H(X).$$

where the right-hand side of the equation is equal to the entropy of the distribution $p(x)$. Thus the entropy—a measure of uncertainty—is both the average information content of a random variable and the average length of the shortest description of a random variable. In fact these two views are equivalent, and the entropy lower bounds the minimum description length.

We can extend ideas of information to two or more variables, e.g., quantifying the amount of (average) information a random variable $X$ contains about another random variable $Y$. This is represented by mutual information. Formally, the mutual information between $X$ and $Y$ is

$$I(X;Y) = H(X)-H(X\mid Y).$$

That is, it's the reduction of uncertainty in $X$ due to the knowledge of $Y$.

Before we continue to the data processing inequality, I need to define a Markov chain which simply describes a way to factorize a joint probability distribution. Specifically, a Markov chain implies that for a set of ordered random variables $X_1,X_2,\ldots,X_n$, knowledge of $X_i$ only relies on $X_{i-1}$ and nothing preceding that value. Formally,

\begin{align*}p(x_1,x_2,\ldots,x_n)&=p(x_1)p(x_2|x_1)p(x_3|x_2,x_1)\cdots p(x_n|x_{n-1},\ldots,x_2,x_1)\\ &=p(x_1)p(x_2|x_1)p(x_3|x_2)\cdots p(x_n|x_{n-1})\end{align*}

where the first equality is true for any probability distribution, and the second equality is due to the set of conditional independencies implied by the Markov property.

Data processing inequality

This brings us to the topic of this article: the data processing inequality. This inequality states that processing a signal—of any kind—results in a loss of information. Formally, the inequality states that for a Markov chain $X \to Y \to Z$, we have

$$ I(X;Y) \ge I(X;Z) \implies H(X\mid Y) \le H(X\mid Z).$$

That is, the mutual information of $X$ and $Y$ is greater than $X$ and $Z$; equivalently, the uncertainty about $X$ given $Y$ is less than the uncertainty about $X$ given $Z$.

To motivate what this means, simply think of $X$ as a message, $Y$ as the message transmitted over a channel, and $Z$ as the received message. That is, $Y=f(X)$ and $Z=g(Y)$, are functions of only the previous step in the chain of events. Given this setup, the inequality states that $Y$ contains more information about $X$ than $Z$.

This inequality should intuitively make sense. When you process data, you are often applying a non-invertible transformation (e.g., filtering the signal to remove noise). As implied by "non-invertible", this is a one-way process. Once you've applied a non-invertible transformation, you cannot recover the original signal. You can make orange juice with an orange, but you can't make an orange with orange juice. Similarly, in processing data, the information that is removed is lost to oblivion, which is what is conveyed by the data processing inequality.

Does modern image enhancement break the data processing inequality?

Following Betteridge's law of headlines, the answer is no.

When we look at image enhancement methods like de-noising and super-resolution, the output $Z$ looks like it better reflects what the true image $X$ would be if it were observed under better conditions as compared to the observed noisy or low-resolution image $Y$. Mathematically, it appears like the following statement is true: $I(X;Y)\le I(X;Z)$—a reversal of the inequality.

The state-of-the-art in machine learning methods—most frequently deep neural networks—create images that closely resemble realistic images without noise (inverting the noise process), or images that have a greater resolution than what than what the measurements allow with respect to the Nyquist rate.

An argument can be made that because the images are passed through a set of filter banks whose parameters are learned (i.e., a neural network), the data processing inequality doesn't apply. Vaguely, the hope is that $Z$ is not only a function of $Y$ but something like $Z=f(Y,g(X))$. In what follows, I argue that this is a flimsy hope that doesn't withstand serious scrutiny.

Applicability of the inequality

As stated before, the data-processing inequality still applies to de-noising and super-resolution. These methods cannot invert an un-invertible function and add lost information. Learning the filters doesn't change the fact that a series of filters—in the case of neural networks—are applied to the image. I'll provide a couple of thought experiments to show why this is the case.

Consider the case where the neural network was randomly initialized to the exact weights that were learned after training. After passing an image through this randomly initialized network, would you say that information was added? The resulting image would be exactly the same as the image generated by the network that arrived at the parameters due to training.

If you think that the learned weights add information but the randomly initialized weights don't, why would one case add information but not the other if the result is the same? What is special about trained parameters $\theta_1$ compared to randomly initialized parameters $\theta_{2}$, when $\theta_1=\theta_2$?

If you think that both sets of parameters add information, would you argue that the data processing is broken when learned and also with some probability $p > 0$ (the case where the weights were randomly initialized to the learned weights)?

I think most reasonable people would see that answering in the affirmative to either argument is absurd, and consequently neither sets of parameters add information.

Admittedly, the above situation would only happen with a vanishingly small (but potentially strictly positive!) probability, and it would only happen with certain initialization schemes. To give a more realistic example, let's compare machine learning methods to plain ol' engineering.

Consider the case of a radio engineer. They inject their learned knowledge into a system by hand-crafting the modules of the system to accomplish some task. For example, they will multiply a received signal with a sinusoid to baseband the signal, and then low-pass filter the signal to remove harmonics and noise before using it for some other task. I doubt that there are many—if any—information theorists or signal processing experts that would consider this to be a case where the data processing inequality doesn't apply.

For a more traditional image enhancement step, consider the case where you observe an image with salt-and-pepper noise and you apply a median filter—the optimal filter for that type of noise. Does the filtered image now have more information than the original image? Most image processing experts would agree that would not be the case.

If neither of the above cases add information, why would a neural network add information? As mentioned before, when the engineer applies their craft, they are using their learned knowledge to enhance the signal or image. What is so materially different about learned engineering knowledge and learned neural network parameters that one adds information and the other doesn't? What could be so sacred and magical about data-driven methods? (As if engineering knowledge wasn't at least partially data-driven.)

To be clear, I am not arguing that the learned parameters of a deep neural network aren't related to the observed data. Of course the parameters are related to the data, otherwise the network would do nothing useful. My argument is much more limited: the learned parameters do not add information that was not already contained in the data. There is no guarantee that the de-noised or super-resolved image match what would have been generated if the image were acquired under less noisy conditions or at a higher resolution.

But if information is not added, what's going on? The de-noised and super-resolved images can often be shown to have lower average error in experiments when compared to more traditional image processing methods. The dilemma lies in the very limited definition of information that is implied in the data processing inequality.

Limitations of the inequality

Let me provide one more example which will more clearly outline the limitations of the definition of information used in the data processing inequality.

Let's consider the case of the information in light, with and without glasses, before the light enters the eye. Loosely speaking, let's describe the case as follows: let $X$ be a vector of the characteristics (e.g., color) of some point on an object; let $Y$ be the (amplitude, frequency, phase, trajectory of the light, etc.) of some light reflected from that point $X$; and let $Z$ be the (amplitude, frequency, phase, trajectory of the light, etc.) of the light $Y$ after having passed through a lens in a pair of glasses. Then we can model $Y = f(X)$ and $Z = g(Y)$ for appropriate functions $f$ and $g$. That is, we have the Markov chain $X \to Y \to Z$.

Then, again, we have a situation where the data processing inequality applies. The data processing inequality then states that $Z$ has less information about $X$ than $Y$. For concreteness, let's say $I(X; Y) = n$ bits and $I(X; Z) = m$ bits for $m \le n$.

Are all the poor simpletons that wear glasses simply misguided due to their lack of education in elementary information theory? I, for one, will keep on wearing my glasses despite the information loss.

This is a confusing result because, as someone with myopia, I perceive much more information in the light that passes through my glasses than without my glasses^[1]. This is because my glasses alter the trajectory of the light such that my focal length is different. As a result, the light through my glasses is not aggregated, sub-optimally, to a collection of retinal cells—causing a perceived blur when viewing distant objects. However, the light $Y$, before it enters my eye, contains what information it contains about $X$ ($n$ bits in this case) and no more. My glasses, the function $g$, cannot add more information to $Y$ about $X$ with $Y$ alone, and $Y$, the light, is the only input to my glasses.

A naive application of the data processing inequality can have this kind of result. Unless you're going to do a thorough information-theoretic analysis of each problem—proving that the problem has all the criteria that the relevant theorems require—a more useful way to think about the effect of processing a signal is in answering the following question: What utility does this processing provide to achieve our goal?

The question that leads people to process signals usually isn't strictly about the amount of information in a signal; it is about how humans or machines can efficiently use the information contained in the signal for some other purpose. Clearly the reason people wear glasses is because they are useful; you probably wouldn't want drivers to forego wearing their glasses just so they could have "more information" according to the (naive) information-theoretic definition.

In the radio example I gave previously, the non-baseband, non-filtered signal is useless. It is full of noise and would require extremely expensive equipment to process the signal at the carrier frequency (notwithstanding the noise). Likewise with the salt-and-pepper image example; if the image is meant for human viewing, why not apply the optimal filter? If the image is an artistic photo, the filtered image could be more pleasant. If the image is for some scientific purpose, perhaps the filtered image will be more interpretable.

Takeaways

While I gave several examples of the general uselessness of the data processing inequality in practical settings, there is a major caveat in this uselessness. The data processing inequality does simply and clearly state that image enhancement techniques like de-noising and super-resolution (outside of cases of multiple measurements or strong assumptions known to be true) cannot recover lost information.

While the resulting images may look better, the image enhancement steps cannot guarantee recovery of the lost information. That is why care must be taken when using these methods in high-stakes applications like medical imaging. It would be reprehensible if de-noising or super-resolution result in the addition or subtraction of a treatable tumor from a clinical image—when normal imaging would not have—and as a result either induces unnecessary treatment (e.g., shooting high-energy particles at the tumor area until the cells die) or goes ignored so long as to condemn the person to death.

Image enhancement techniques need to be motivated by their utility, not their magical abilities. Image enhancement is an important research area because it can be very useful—not because it can accomplish the image processing equivalent of transmuting lead into gold. In natural images, de-noised and super-resolved images can be more visually pleasing, as previously mentioned, and can be used for that purpose.

In medical imaging, they can be used by researchers to collect more stable measurements extracted from images both across subjects and longitudinally (e.g., better whole-brain segmentation). When large sample sizes are used, the individual mistakes due to the methods—if present—can be washed out (assuming no bias). Or more simply, the methods can simply allow a study to happen.

For example, suppose you are a scientist doing research on a disease that is apparent in medical images, and you only have access to a set of clinical images that are very noisy and have a resolution of $1\times 1\times 3 \text{ mm}^3$. You want to segment the diseased portion of the tissue and either do not have the data or technical skill to implement a segmentation deep neural network to do the task. However, there is a publicly-available deep neural network that was trained for this segmentation task on nearly noise-free data at $1 \text{ mm}^3$. In this case, it would be beneficial to first de-noise and super-resolve the images, otherwise the network may perform poorly because of domain shift.

The data processing inequality is a mixed bag. It can be illuminating under certain conditions, but the limited definition of information central to the inequality can be deceiving.

Addendum: Side information can be provided about $X$ if you have some independent $T(X)$ or have access to $W$ such that $X = h(W)$. But then we're analyzing a different problem because $X \to Y \to Z$ (assuming you include $T(X)$ or $W$ to create $Z$) doesn't form a Markov chain, so the data processing inequality doesn't apply.

Bayesianism won't save you here. A good prior on $X$ won't add information about $X$ if you only have access to $Y$; a Markov chain $X\to Y \to Z$ is simply a joint distribution that factorizes as $p(x,y,z) = p(x)p(y|x)(z|y)$. The exact, true prior of $X$ is implicit in the proof of the theorem!

Your brain will actually receive more information for technical reasons, but I'm narrowly focusing on the information content of the light itself before it enters the eye. ↩︎

Learning to read from memory with a neural network

Jacob Reinhold — Wed, 09 Sep 2020 17:30:30 GMT

To compute complex functions, computers rely on the ability to read from and write to memory. This ability is missing, however, from standard deep neural neural networks (DNNs), and research has shown that reading and writing to external memory facilitates certain types of computation. But how can we train a DNN to access memory? Both reading and writing are non-differentiable functions—as devised in digital computers—so they are incompatible with backpropagation which is the standard approach to training a DNN. There has been a substantial amount of work that infuses recurrent neural networks with external memory; however, in this post, we’ll make use of the Gumbel-softmax reparameterization to allow a feedforward DNN to read from an external memory bank.

Integers and indexing

A real-world digital computer has a finite bank of memory which is addressed with a binary number. When written to, the memory location associated with that number stores a value that the computer can load for later processing. This ability to store and load data to memory enables computers to compute a large class of functions (that is, the set of functions that are computable by a Turing machine or any equivalent model of computation). Theoretically, recurrent neural networks are capable of simulating a Turing machine and, consequently, can compute the same set of functions. However, to the best of my knowledge, no such proof has been shown for feedforward networks (especially ones of finite-depth and width).

Let's briefly consider how memory works in a real-world computer. Suppose you have a computer with 16-bit memory. You might have an instruction located at memory address (hexadecimal) 0x0000 and a datum, necessary for some computation, stored at 0x1000. Suppose the computer has an instruction at 0x0000 saying to load the datum at 0x1000 to a register. When the computer is started—in this simplistic example—a piece of circuitry called a program counter will configure the state of the circuitry in the CPU to load the instruction at 0x0000 which configure the state of the circuitry to load the datum at 0x1000 into a specified register and used for the further computation (that is, the instructions that follow 0x0000).

Standard feedforward DNNs cannot learn to do this type of loading or any type of loading similar to this from an external memory source. DNNs have real-valued (really, floating-point) parameters and treat input and output as real-valued arrays of numbers, even if the input and output are integer-valued (or, as in the case of categorical variables, can be mapped to the integers). For example, 2D natural images are often composed of integers between 0 and 255; however, this property is ignored in standard DNNs and all inputs are cast to real numbers. In classification tasks, the output of the network—before any user-defined thresholding or argmax operation—represents a probability and is correspondingly real-valued. This transformation from integers to real-valued numbers is necessary for backpropagation to update the weights of the DNN.

But what if you are in a situation where you require integer-values in the middle of the network? A naive solution would be to use rounding. This, however, is non-differentiable (or at least has a zero gradient almost everywhere). So if rounding is used, the DNN cannot update its weights.

Let’s assume that we are working with an external memory bank that is an array of numbers (a tensor, if you like), and the job of the network is to calculate the (one-hot) index location of a value in that memory bank. An way to do this is to allow for some fuzziness and take a weighted average of several locations (using the softmax function on what represents the memory indices) as in the Neural Turning Machine. However, I’d argue that soft indexing is not as interpretable as simply addressing one location from memory.

As a silly example to illustrate the point, let’s say you want to train one end-to-end DNN to compare user-input pictures of dogs and cats to a prototypical picture of a dog and cat based on whatever class of image the user input (e.g., compare a user-input picture of a cat, in some way, to the prototypical picture of a cat). If you use soft indexing, then the input to the comparison section of the network will be comparing a pixelwise weighted average of the prototypical cat and dog image to the user image. The more desirable function of the DNN would be to use only the prototypical cat image if the user image is of a cat and likewise for a dog.

Hard indexing with the Gumbel-softmax reparameterization

We can create hard indices—true one-hot index vectors—through a trick called the Gumbel-softmax (GS) reparameterization. The GS relaxes the hard indexing into a soft indexing problem that can be (asymptotically) viewed as an argmax operation.

Similar to the Gaussian reparameterization discussed in the variational autoencoder, the GS allows backpropagation to work with a sampling step in the middle of a DNN—a normally non-differentiable operation. GS works by changing the sampling operation in such a way that all of the component operations are differentiable. While there are other ways that you can estimate the gradient of a function of integer-valued variables, the GS reparameterization provides a better estimate of the gradient (in the sense that the gradient estimate has lower variance).

While the theoretical construction of the GS is outside the scope of this post, I'll give a high-level overview of how the GS works and provide working code in PyTorch. Suppose we have a neural network $f(\cdot)$ with a hidden layer $f_i(\cdot)$ producing a representation $h \in \mathbb{R}^n$ that we intend to use as memory indices.

To use $h$ to create a one-hot vector that indexes a memory location we:

Generate a sample $g\in\mathbb{R}^n$ from a Gumbel distribution,
Add $h$ and $g$,
Divide the result by a temperature value $\tau > 0$,
Take softmax of the result.

The temperature value $\tau$, as it goes to zero, makes the softmax operation functionally equivalent to an argmax operation. An example of the result is shown in Fig. 1 where an example distribution is given in the left-most image and the result of the GS sampling process is given in the four plots to the right, each with different values of $\tau$. We see that, as $\tau$ gets close to 0, the sampling does function as an approximate argmax operation. When $\tau$ grows, the less the operation looks like an argmax. In practice, setting $\tau$ very close to zero causes exploding gradients which make training unstable, so we set $\tau$ reasonably low or scale it down it as training progresses.

Fig. 1: Example of a distribution and the resulting sampling using the Gumbel-softmax trick with varying levels of temperature ($\tau\in\{0.1,0.25,0.5,2.0\}$)

Because we cannot set $\tau$ to zero, we will have non-zero values in other indices—a result that we want to avoid according to the problem setup. We will use the straight-through GS trick to create a true one-hot vector. Let $y$ be the output of the GS and let $y_{\mathrm{oh}}$ be the one-hot construction of the argmax of $y$. Then the straight-through GS simply does the following:

$$ y_{\mathrm{st}} = \mathrm{detach}(y_{\mathrm{oh}} - y) + y,$$

where $\mathrm{detach}(\cdot)$ detaches the argument from the computation graph keeping track of the gradients for backpropagation. Adding a number is differentiable and the resulting number, $y_{\mathrm{st}}$ has the same values as $y_{\mathrm{oh}}$ but is still on the computation graph (with the gradients of $y$).

Implementing the above in PyTorch is simple and the code is below where the gumbel_softmax function takes in the hidden representation (e.g., $h$) and outputs the straight-through GS result.

def sample_gumbel(logits, eps=1e-8):
    U = torch.rand_like(logits)
    return -torch.log(-torch.log(U + eps) + eps)

def sample_gumbel_softmax(logits, temperature):
    y = logits + sample_gumbel(logits)
    return F.softmax(y / temperature, dim=-1)

def gumbel_softmax(logits, temperature=0.67):
    y = sample_gumbel_softmax(logits, temperature)
    shape = y.size()
    _, ind = y.max(dim=-1)
    y_hard = torch.zeros_like(y).view(-1, shape[-1])
    y_hard.scatter_(1, ind.view(-1, 1), 1)
    y_hard = y_hard.view(*shape)
    return (y_hard - y).detach() + y

Learning to address memory

I’ll show a very basic example of a network that can learn to address a memory location using the Gumbel-softmax. (A Jupyter notebook with the full implementation can be found here.)

We’ll have a feedforward neural network fit a (noisy) quadratic function using a memory bank of 5 values uniformly spaced in the interval $[0, 1]$. If the network learns to pick the correct values of the memory bank, we should see the resulting estimator get close to the best piecewise constant estimate of the true quadratic function, where the constants are those values in the memory bank.

The training data—shown in Fig. 2—will consist of values between -1 and 1, representing the independent variable $x$, with corresponding dependent variables $y=x^2+\varepsilon$ where $\varepsilon \stackrel{iid}{\sim} \mathcal N(0,\sigma^2)$.

Fig. 2: Training data for the experiment

The network will consist of a few fully-connected layers using batch norm and ReLU activations (but no ReLU or batch norm on the final layer).

The output of these several layers will be fed into the gumbel_softmax function, as described in the previous section, and the resulting one-hot vector will be used to address the memory bank by simply matrix multiplying (or taking the inner product, if you prefer) the one-hot vector with the memory bank. An example implementation of external memory for our simple setup is shown in PyTorch below, where the the instantiation argument is a tensor $\mathrm{memory} \in \mathbb{R}^m$ and $\mathrm{idx} \in \mathbb{R}^{n\times m}$ where $m$ is the number of memory elements and $n$ is the batch size.

class MemoryTensor(nn.Module):
    def __init__(self, memory:Tensor):
        super().__init__()
        memory.unsqueeze_(1)
        self.memory = memory
        
    def __getitem__(self, idx:Tensor):
        if self.training:
            idx = gumbel_softmax(idx)
            out = idx @ self.memory
        else:
            idx = torch.argmax(idx, dim=1)
            out = self.memory[idx]
        return out

The MemoryTensor can then be indexed in the network with something like the following (where net is some neural network that outputs a tensor $\mathbb{R}^{n\times 3}$ in this illustration which is not specifically related to the experiment).

memory = MemoryTensor(torch.tensor([1.,2.,3.]))
idx = net(x)
y = memory[idx]

Note that when training the MemoryTensor uses the GS, and when in production it uses argmax. The argmax is often more appropriate in production because it removes the random sampling; in some cases, the random sampling may be desired but not in this application.

Because only one location of the memory bank is 1 and all other entries are 0 (in both training and production), the result will consist only of the value from one memory location; that is, the network will have functionally addressed a memory location and read its value.

We train the network for several hundred iterations with the Adam optimizer using MSE as a loss function and get the result seen in Fig. 3 in the dashed blue "Fit" line. The optimal fit—the best piecewise constant estimate of the quadratic—using the values in memory is shown in the solid green line labeled "Best Fit." The two are (qualitatively) nearly identical and we can say that this simple experiment was successful.

Fig. 3: Resulting output of the neural network trained to use the values stored in memory (Fit) compared to the optimal fit using those same memory values (Best Fit).

Caveat: This method will not know how to use additional memory locations. If you doubled the resolution of the memory bank, the network would fail to use that extra information. So this form of memory addressing is brittle. But, you could swap the memory out for another set of variables if the mapping was the same. For example, if the quadratic was changed to $2x^2$, then you could swap the memory with the values multiplied by two and expect the same performance. Something you can’t do with a standard neural network.

Takeaways

We showed a method to read from external memory inside a feedforward neural network, which is normally an operation that prevents the network from being trained. We used the Gumbel-softmax reparameterization to create true one-hot vectors that index one location in memory, and we walked through a toy experiment showing that the proposed method to read from memory does work as expected.

While the toy experiment was too simplistic to showcase the possibilities of this framework, you can, for example, easily extend this method train a network to read from a bank of images. Like the cat and dog example discussed in the Integers and Indexing section, it seems plausible that there are scenarios where you want to use some prototypical or ideal examples of a class of images inside the network to be used for further processing or improve the performance on some task. To be more concrete, training a network end-to-end with an external memory bank of images could be used to improve classification (e.g., find a nearest example) or improve multi-atlas segmentation by choosing the best image to register to another image.

Regardless of whether you believe that this method can improve performance in the concrete examples above, the research points to the fact that reading from external memory does facilitate certain types of computation, and this method is one simple way to do so—potentially increasing the space of functions practically approximated by a feedforward neural network.

Aleatory or epistemic? WTF are those?

Jacob Reinhold — Fri, 31 Jul 2020 02:10:50 GMT

Deep neural networks (DNNs) are easy-to-implement, versatile machine learning models that can achieve state-of-the-art performance in many domains (for example, computer vision, natural language processing, speech recognition, recommendation systems). DNNs, however, are not perfect. You can read any number of articles, blog posts, and books discussing the various problems with supervised deep learning. In this article we'll focus on a (relatively) narrow but major issue: the inability for a standard DNN to reliably show when it is uncertain about a prediction. For a Rumsfeldian take on it: The inability of DNNs to know "known unknowns."

As a simple example of this failure mode in DNNs, consider training a DNN for a binary classification task. You might reasonably presume that the softmax (or sigmoid) output of a DNN could be used to measure how certain or uncertain the DNN is in its prediction; you would expect that seeing a softmax output close to 0 or 1 would indicate certainty, and an output close to 0.5 would indicate uncertainty. In reality, the softmax outputs are rarely close to 0.5 and are, more frequently than not, close to 0 or 1 regardless of whether the DNN is making a correct prediction. Unfortunately, this fact makes naive uncertainty estimates unreliable (for instance, entropy over the softmax outputs).

To be fair, uncertainty estimates are not needed for every application of a DNN. If a social media company uses a DNN to detect faces in images so that its users can more easily tag their friends, and the DNN fails, then the failure of the method is nearly inconsequential. A user might be slightly inconvenienced, but in low-stakes environments like social media or advertising, uncertainty estimates aren't vital to creating value from a DNN.

In high-stakes environments, however, like self-driving cars, health care, or military applications, a measure of how uncertain the DNN is in its prediction could be vital. Uncertainty measurements can reduce the risk of deploying a model because they can alert a user to the fact that a scenario is either inherently difficult to do prediction in, or the scenario has not been seen by the model before.

In a self-driving car, it seems plausible that a DNN should be more uncertain about predictions at night (at least in the measurements coming from optical cameras) because of the lower signal-to-noise ratio. In health care, a DNN that diagnoses skin cancer should be more uncertain if it were shown a particularly blurry image, especially if the model had not seen such blurry examples in the training set. In a model to segment satellite imagery, a DNN should be more uncertain if an adversary changed how they disguise certain military installations. If the uncertainty inherent in these situations were relayed to the user, the information could be used to change the behavior of the system in a safer way.

In this article, we explore how to estimate two types of statistical uncertainty alongside a prediction in a DNN. We first discuss the definition of both types of uncertainty, and then we highlight one popular and easy-to-implement technique to estimate these types of uncertainty. Finally, we show and implement some examples for both classification and regression that makes use of these uncertainty estimates.

For those who are most interested in looking at code examples, here are two Jupyter Notebooks one with a toy regression example and the other with a toy classification example. There are also PyTorch-based code snippets in the "Examples and Applications" section below.

What do we mean by 'uncertainty'?

Uncertainty is defined by the Cambridge Dictionary as: "a situation in which something is not known." There are several reasons why something may not be known, and—taking a statistical perspective —we will discuss two types of uncertainty called aleatory (sometimes referred to as aleatoric) and epistemic uncertainty.

Aleatory uncertainty relates to an objective or physical concept of uncertainty—it is a type of uncertainty that is intrinsic to the data-generating process. Since aleatory uncertainty has to do with an intrinsic quality of the data, we presume it cannot be decreased by collecting more data; that is, it is irreducible.

Aleatory uncertainty can be explained best with a simple example: Suppose we have a coin which has some positive probability of being heads or tails. Then, even if the coin is biased, we cannot predict—with certainty—what the next toss will be, regardless of how many observations we make. (For instance, if the coin is biased such that heads turn up with probability 0.9, we might reasonably guess that heads will show up in the next toss, but we cannot be certain that it will happen.)

Epistemic uncertainty relates to a subjective or personal concept of uncertainty—it is a type of uncertainty due to knowledge or ignorance of the true data-generating process. Since this type of uncertainty has to do with knowledge, we presume that it can be decreased (for example, when more data has been collected and used for training); that is, it is reducible.

Epistemic uncertainty can be explained with a regression example. Suppose we are fitting a linear regression model and we have independent variables $x$ between -1 and 1, and corresponding dependent variables $y$ for all $x$. Suppose we chose a linear model because we believe that when $x$ is between -1 and 1, the model is linear. We don't, however, know what happens when a test sample $x'$ is far outside this range; say at $x'$ = 100. So, in this scenario, there is uncertainty about the model specification (for example, the true function may be quadratic) and there is uncertainty because the model hasn't seen data in the range of the test sample. These uncertainties can be bundled into uncertainty regarding the knowledge of true data-generating distribution, which is epistemic uncertainty.

The terms aleatory and epistemic, with regards to probability and uncertainty, seem to have been brought into the modern lexicon by Ian Hacking in his book "The Emergence of Probability," which discusses the history of probability from 1600–1750. The terms are not clear for the uninitiated reader, but their definitions are related to the deepest question in the foundations of probability and statistics: What does probability mean? If you are familiar with terms frequentist and Bayesian, then you will see the respective relationship between aleatory (objective) and epistemic (subjective) uncertainty. I'm not about to solve this philosophical issue in this blog post, but know that the definitions of aleatory and epistemic uncertainty are nuanced, and what falls into which category is debatable. For a more comprehensive (but still applied) review of these terms, take a look at the article: "Aleatory or Epistemic? Does it matter?"

Why is it important to distinguish between aleatory and epistemic uncertainty? Suppose we are developing a self-driving car, and we take a prototype that was trained on normal roads and have it drive through the Monza racing track, which has extremely banked turns.

Fig. 1: Banked turns on the Monza Racing Track

Since the car hasn't seen the situation before, we would expect the image segmentation DNN in the self-driving car, for example, to be uncertain because it has never seen the sky nearly to the left of ground. In this case, the uncertainty would be classified as epistemic because the DNN doesn't have knowledge of roads like this.

Suppose instead that we take the same self-driving car and take it for a drive on a rainy day; assume that the DNN has been trained on lots of rainy-day conditions. In this situation, there is more uncertainty about objects on the road simply due to lower visibility. In this case, the uncertainty would be classified as aleatory because there is inherently more randomness in the data.

These two situations should be dealt with differently. In the race track, the uncertainty could tell the developers that they need to gather a particular type of training data to make the model more robust, or the uncertainty could tell the car could try to safely maneuver to a location where it can hand-off control to the driver. In the rainy-day situation, the uncertainty could alert the system to simply slow down or enable certain safety features.

Estimating uncertainty in DNNs

There has been a cornucopia of proposed methods to estimate uncertainty in DNNS in recent years. Generally, uncertainty estimation is formulated in the context of Bayesian statistics. In a standard DNN for classification, we are implicitly training a discriminative model where we obtain maximum-likelihood estimates of the neural network weights (depending on the loss function chosen to train the network). This point-estimate of the network weights is not amenable to understanding what the model knows and does not know. If we instead find a distribution over the weights, as opposed to the point-estimate, we can sample network weights with which we can compute corresponding outputs.

Intuitively, this sampling of network weights is like creating an ensemble of networks to do the task: We sample a set of "experts" to make a prediction. If the experts are inconsistent, there is high epistemic uncertainty. If the experts think it is too difficult to make an accurate prediction, there is high aleatory uncertainty.

In this article, we'll take a look at a popular and easy-to-implement method to estimate uncertainty in DNNs by Yarin Gal and Zoubin Ghahramani. They showed that dropout can be used to learn an approximate distribution over the weights of a DNN (as previously discussed). Then, during prediction, dropout is used to sample weights from this fitted approximate distribution—akin to creating the ensemble of experts.

Epistemic uncertainty is estimated by taking the sample variance of the predictions from the sampled weights. The intuition behind relating sample variance to epistemic uncertainty is that the sample variance will be low when the model predicts nearly identical outputs, and it will be high when the model makes inconsistent predictions; this is akin to when the set of experts consistently makes a prediction and when they do not, respectively.

Simultaneously, aleatory uncertainty is estimated by modifying a DNN to have a second output, as well as using a modified loss function. Aleatory uncertainty will correspond to the estimated variance of the output. This predicted variance has to do with an intrinsic quantity of the data, which is why it is related to aleatory uncertainty; this is akin to when the set of experts judges the situation too difficult to make a prediction.

Altogether the final network structure is something like what is shown in Fig. 2. There is an input x which is fed to a DNN with dropout after every layer (dropout after every layer is what is originally specified, but—in practice—dropout after every layer often makes training too difficult). The output of this DNN is an estimated target $\hat{\mathbf{y}}$ and an estimated variance or scale parameter $\hat{\sigma}^2$.

Fig. 2: Example of DNN architecture with capability to estimate aleatory and epistemic uncertainty

This DNN is trained with a loss function like:

$$ \mathcal{L}(\mathbf{y}, (\hat{\mathbf{y}},\hat{\sigma}^2)) = \frac{1}{M} \sum_{i=1}^M \frac{1}{2} \hat{{\sigma}}_i^{-2} \lVert \mathbf{y}_i - \mathbf{\hat{y}}_i \rVert_2^2 + \frac{1}{2} \log \hat{\sigma}_i^2 $$

$$ \mathcal{L}(\mathbf{y}, (\hat{\mathbf{y}}, \hat{\mathbf{b}})) = \frac{1}{M} \sum_{i=1}^M \hat{\mathbf{b}}_i^{-1} \lVert \mathbf{y}_i - \hat{\mathbf{y}}_i \rVert_1 + \log \hat{\mathbf{b}}_i $$

If the network is being trained for a regression task. The first loss function shown above is an MSE variant with uncertainty, whereas the second is an L1 variant. These are derived from assuming a Gaussian and Laplace distribution for the likelihood, respectively, where each component is independent and the variance (or scale parameter) is estimated and fitted by the network.

As mentioned above, these loss functions have mathematical derivations, but we can intuit why this variance parameter captures a type of uncertainty: The variance parameter provides a trade-off between the variance and the MSE or L1 loss term. If the DNN can easily estimate the true value of the target (that is, get ŷ close to the true y), then the DNN should estimate a low variance term on that so as to minimize the loss. If, however, the DNN cannot estimate the true value of the target (for example, there is low signal-to-noise ratio), then the network can minimize the loss by estimating a high variance. This will reduce the MSE or L1 loss term because that term will be divided by the variance; however, the network should not always do this because of the log variance term which penalizes high variance estimates.

If the network is being trained for a classification (or segmentation) task, the loss would look something like this two-part loss function:

$$ \hat{\mathbf{x}}_t = \hat{\mathbf{y}} + \varepsilon_t \qquad \varepsilon_t \sim \mathcal{N}(\mathbf{0},\mathrm{diag}(\hat{\sigma}^2)) $$

$$ \mathcal{L}(\mathbf{x}, \hat{\mathbf{x}}) = \frac{1}{T} \sum_{t=1}^T \mathrm{Cross\ entropy}(\mathbf{x}, \hat{\mathbf{x}}_t) $$

The intuition here with this loss function is: When the DNN can easily estimate the right class of a component, the value $\hat{\mathbf{y}}$ will be high for that class and the DNN should estimate a low variance so as to minimize the added noise (so that all samples will be concentrated around the correct class). If, however, the DNN cannot easily estimate the class of the component, the $\hat{\mathbf{y}}$ value should be low and adding noise can increase the guess, by chance, for the correct class which can overall minimize the loss function. (See Pg. 41 of Alex Kendall's thesis for more discussion on this loss function.)

Finally, in testing, the network is sampled T times to create T estimated targets and T estimated variance outputs. These T outputs are then combined in various ways to make the final estimated target and uncertainty estimates as shown in Fig. 3.

Fig. 3: DNN output to final estimated target and uncertainty estimates

Mathematically, the epistemic and aleatory uncertainty are (for the MSE regression variant):

$$ \mathrm{Aleatory\ uncertainty} = \frac{1}{T} \sum_{t=1}^T \hat{\sigma}^2_t$$

$$ \mathrm{Epistemic\ uncertainty} = \frac{1}{T} \sum_{t=1}^T (\mathbf{\hat{y}}_t - \bar{\mathbf{y}})^2 $$

There are various interpretations of epistemic uncertainty for the classification case: entropy, sample variance, mutual information. Each has been shown to be useful in its own right, and the choice of what type to choose will be application dependent.

Examples and applications

To make the theory more concrete, we'll go through two toy examples for estimating uncertainty with DNNs in a regression and classification task with PyTorch. The code below are excerpts from full implementations which are available in Jupyter notebooks (mentioned at the beginning of the next two subsections). Finally, we'll discuss calculating uncertainty in a real-world data example with medical images.

Regression example

In the regression notebook, we fit a very simple neural network—consisting of two fully-connected layers with dropout on the hidden layer—to one-dimensional input and output data with the MSE variant of the uncertainty loss (implemented below).

class ExtendedMSELoss(nn.Module):
    """ modified MSE loss for variance fitting """
    def forward(self, out:torch.Tensor, y:torch.Tensor) -> torch.Tensor:
        yhat, s = out
        loss = torch.mean(0.5 * (torch.exp(-s) * F.mse_loss(yhat, y, reduction='none') + s))
        return loss

Note that instead of fitting the variance term directly, we fit the log of the variance term for numerical stability.

In the regression scenario, we could also use the L1 variant of the uncertainty loss which is in the notebook and implemented below.

class ExtendedL1Loss(nn.Module):
    """ modified L1 loss for scale param. fitting """
    def forward(self, out:torch.Tensor, y:torch.Tensor) -> torch.Tensor:
        yhat, s = out
        loss = torch.mean((torch.exp(-s) * F.l1_loss(yhat, y, reduction='none')) + s)
        return loss

Sometimes using L1 loss instead of MSE loss results in better performance for regression tasks, although this is application dependent.

The aleatory and epistemic uncertainty estimates in this scenario are then computed as in the implementation below (see the notebook for more context).

def regression_uncertainty(yhat:torch.Tensor, s:torch.Tensor, mse:bool=True) -> Tuple[torch.Tensor, torch.Tensor]:
    """ calculate epistemic and aleatory uncertainty quantities based on whether MSE or L1 loss used """
    # variance over samples (dim=0), mean over channels (dim=1, after reduction by variance calculation)
    epistemic = torch.mean(yhat.var(dim=0, unbiased=True), dim=1, keepdim=True)
    aleatory = torch.mean(torch.exp(s), dim=0) if mse else torch.mean(2*torch.exp(s)**2, dim=0)
    return epistemic, aleatory

In Fig. 4, we visualize the fit function and the uncertainty results. In the plot to the far right, we show the thresholded epistemic uncertainty which demonstrates the capabilities of uncertainty estimates to detect out-of-distribution data (at least in this toy scenario).

Fig. 4: Various types of uncertainty for a regression example. Original training data in orange. The two plots on the left show the function fit by the neural network in blue with aleatory and epistemic uncertainty in the first and second plot, respectively. The plot on the far right shows a thresholded epistemic uncertainty. See the Jupyter Notebook for the full implementation.

Classification example

In the classification notebook, we, again, fit a neural network composed of two fully-connected layers with dropout on the hidden layer. In this case, we are trying to do binary classification. Consequently, the loss function is as implemented below.

class ExtendedBCELoss(nn.Module):
    """ modified BCE loss for variance fitting """
    def forward(self, out:torch.Tensor, y:torch.Tensor, n_samp:int=10) -> torch.Tensor:
        logit, sigma = out
        dist = torch.distributions.Normal(logit, torch.exp(sigma))
        mc_logs = dist.rsample((n_samp,))
        loss = 0.
        for mc_log in mc_logs:
            loss += F.binary_cross_entropy_with_logits(mc_log, y)
        loss /= n_samp
        return loss

There are numerous uncertainty estimates we could compute in this scenario. In the below implementation, we calculate epistemic, entropy, and aleatory uncertainty. Entropy could reasonably be argued to belong to one of aleatory and epistemic uncertainty, but below it is separated out so that aleatory and epistemic uncertainty are calculated as previously described.

def classification_uncertainty(logits:torch.Tensor, sigmas:torch.Tensor, eps:float=1e-6) -> Tuple[torch.Tensor, torch.Tensor]:
    """ calculate epistemic, entropy, and aleatory uncertainty quantities """
    probits = torch.sigmoid(logits)
    epistemic = probits.var(dim=0, unbiased=True)
    probit = probits.mean(dim=0)
    entropy = -1 * (probit * (probit + eps).log2() + ((1 - probit) * (1 - probit + eps).log2()))
    aleatory = torch.exp(sigmas).mean(dim=0)
    return epistemic, entropy, aleatory

In Fig. 5, we visualize the resulting epistemic and aleatory uncertainty, as well as entropy, over the training data. As we can see the training data classes overlaps near zero, and the uncertainty measures peak there. In this toy example, all three measures of uncertainty are highly correlated. Discussion as to why is provided in the notebook for the interested reader.

Fig. 5: Various measures of uncertainty for a binary classification example. See the Jupyter Notebook for the full implementation.

Medical image example

In this last example, I'll show some results and applications of uncertainty in a real-world example published as a conference paper (pre-print here). The task explored is an image-to-image translation task, akin to the notable pix2pix example, but with medical images. In this case, we wanted to make a computed tomography (CT) image of the brain look like the corresponding magnetic resonance (MR) image of the brain. This is a regression loss and we used the MSE variant of the uncertainty loss to train a U-Net modified to have spatial dropout (see here for a discussion as to why spatial dropout) after every layer, and to output two images instead of only one; one output is the estimated MR image and the other is the pixel-wise variance.

Example inputs and outputs are shown in Fig. 6. The CT image on the far left has an anomaly in the left hemisphere of the occipital lobe (lower-left of the brain in the image; it is more easily visualized in the corresponding MR image to the right). The DNN was only trained on healthy images, so the DNN should be ignorant of such anomalous data, and it should reflect this—according to the theory of epistemic uncertainty as previously discussed—by having high sample variance (that is, high epistemic uncertainty) in that region.

Fig. 6: From left to right – CT image is input to the network, MR image is target in training (shown here because anomaly visible in the occipital lobe in the left hemisphere), pixel-wise epistemic uncertainty, pixel-wise aleatory uncertainty, and the pixel-wise ratio of epistemic over aleatory is shown under the title "Scibilic" which clearly highlights the anomalous region.

When this image was input to the network, we calculated the epistemic and aleatory uncertainty. The anomaly is clearly highlighted in the epistemic uncertainty, but there are many other regions which are also predicted to have high epistemic uncertainty. If we take the pixel-wise ratio of epistemic and aleatory uncertainty, we get the image shown on the far-right, labeled "Scibilic" (which is discussed more in the pre-print). This image is easily thresholded to predict the anomaly (the out-of-distribution region of the image).

This method of anomaly detection is by no means foolproof. It is quite fickle actually, but it shows a way to apply this type of uncertainty estimation for real-world data.

Takeaways

Uncertainty estimates in machine learning have the potential to reduce the risk of deploying models in high-stakes scenarios. Aleatory and epistemic uncertainty estimates can show the user or developer different information about the performance of a DNN and can be used to modify the system for better safety. We discussed and implemented one approach to uncertainty estimation with dropout. The approach is not perfect, dropout-based uncertainty provides a way to get some—often reasonable—measure of uncertainty. Whether the measure is trustworthy enough to be used in deployment is another matter. The question practitioners should ask themselves when implementing this method is whether the resulting model with uncertainty estimates is more useful—for example, safer—than a model without uncertainty estimates.

New website and blog

Jacob Reinhold — Wed, 29 Jul 2020 16:38:00 GMT

I've been getting into the habit of regularly publishing my writing online. Writing for a broader audience than academic papers has been a great exercise in clarifying my thinking on a subject and improving my writing in general.

Previously I had a personal website using GitHub pages (along with an attached Jekyll blog which served as a repository for URLs and notes), and I separately posted longer-form articles on Medium. (I also occasionally write longer-form articles for Innolitics.) My personal website was bare-bones to the point that it wasn't responsive to screen sizes. And I wasn't happy with the experience of writing technical content on Medium; Medium doesn't support MathJax or syntax highlighting which makes writing about machine learning arduous.

Those problems plus some inspiration from other blog posts motivated me to create a new website and blog using the Ghost platform. This website is the result, and I think it is nicer looking than my old personal website, responsive, and supports technical writing much better than Medium.

As far as the content I plan to post, I'll continue to focus on topics in medical image analysis and machine learning (which is the focus of my PhD). You can follow me here using an RSS feed or on Twitter at @JacobCReinhold.