# Image enhancement and the data processing inequality

Information theory studies the communication of information with the language of probability theory and statistics. Claude Shannon laid out the foundations of the field with his work *A Mathematical Theory of Communication* in 1948. The paper defines a reasonable measure of information (more specifically, a measure of *average* information), and proves some results regarding how efficiently you can communicate over a noiseless and noisy channel (e.g., radio over the air or storage of data in an imperfect medium like a hard drive). Since that time, information theory has continued to be studied and is partially behind many of the advances in telecommunications, and it provides some useful methods of analysis for statistics and machine learning.

While many results of information theory have practical importance, there is one inequality that is often misguidedly invoked; namely, the *data processing inequality.* We'll quickly overview the basics of information theory so that we can properly motivate the inequality, and then we'll discuss its ability and its limitations in illuminating the impact of data processing. Specifically, we'll discuss how it applies to image enhancement such as super-resolution; we'll show that it is both an essential component of analysis because it shows that image enhancement cannot create new information, and that it is useless in practice because it is the wrong way to analyze image enhancement methods.

## Information and entropy

Mathematically, information is defined as the logarithm of the reciprocal of a probability mass, i.e., for a (discrete) random variable $x\sim p(x)$, the information gained by observing $x=a$ for $x\in\mathcal{X}$ is

$$h(x=a)=\log_2 \frac{1}{p(a)}.$$

The heuristic argument for this quantity is that rare events should communicate more information than common events—the more surprising an event, the more you update your beliefs upon observing the event—which is why the reciprocal of the probability mass is used. The intuition behind the logarithm is more complex. I'll motivate it in the sense of optimal code-lengths.

First we need to define a (source) *code* which is a mapping from some set of events (i.e., whatever $\mathcal{X}$ represents) to a set of strings of symbols (e.g., strings of 0s and 1s). Then the log of the reciprocal of the probability mass of an event is equal to a (competitively) optimal code-length that describes the event, where "optimality" is in the sense of maximal average compression of a sequence of transmitted symbols.

For example, if some event $x=b$ has $p(b)=1/2$, then the optimal (source) code-length is 1. If some event $x=c$ has $p(c)=1/256$, then the optimal code-length is 8. Why is encoding in this way optimal? The intuition is that common events should be encoded with short codes because they are frequently encountered, and short codes reduce the total number of symbols that you need to transmit to communicate a message. The opposite holds for rare events; you pay little penalty for encoding rare sequences with long codes because they are communicated infrequently. In the end, you'll reduce the average number of symbols used to communicate a string of events if you encode in this way.

To show this mathematically, let $f:\mathcal{X}\to\{0,1\}^m$ describe the encoder—i.e., the map from events $\mathcal{X}$ to strings of symbols $\{0,1\}^m$ (where $m$ can possibly vary)—and let $\ell(f(x)):\{0,1\}^m\to\mathbb{N}$ represent the length of the code associated with event $x\in\mathcal{X}$. Then the average length of this encoding scheme is:

$$\mathbb{E}[\ell(f(X))]=\sum_{x\in\mathcal{X}}p(x)\ell(f(x))=\sum_{x\in\mathcal{X}}p(x)\log_2\frac{1}{p(x)}=H(X).$$

where the right-hand side of the equation is equal to the entropy of the distribution $p(x)$. Thus the entropy—a measure of uncertainty—is both the average information content of a random variable and the average length of the shortest description of a random variable. In fact these two views are equivalent, and the entropy lower bounds the minimum description length.

We can extend ideas of information to two or more variables, e.g., quantifying the amount of (average) information a random variable $X$ contains about another random variable $Y$. This is represented by *mutual information*. Formally, the mutual information between $X$ and $Y$ is

$$I(X;Y) = H(X)-H(X\mid Y).$$

That is, it's the reduction of uncertainty in $X$ due to the knowledge of $Y$.

Before we continue to the data processing inequality, I need to define a Markov chain which simply describes a way to factorize a joint probability distribution. Specifically, a Markov chain implies that for a set of ordered random variables $X_1,X_2,\ldots,X_n$, knowledge of $X_i$ only relies on $X_{i-1}$ and nothing preceding that value. Formally,

\begin{align*}p(x_1,x_2,\ldots,x_n)&=p(x_1)p(x_2|x_1)p(x_3|x_2,x_1)\cdots p(x_n|x_{n-1},\ldots,x_2,x_1)\\ &=p(x_1)p(x_2|x_1)p(x_3|x_2)\cdots p(x_n|x_{n-1})\end{align*}

where the first equality is true for any probability distribution, and the second equality is due to the set of conditional independencies implied by the Markov property.

## Data processing inequality

This brings us to the topic of this article: the *data processing inequality*. This inequality states that processing a signal—of any kind—results in a loss of information. Formally, the inequality states that for a Markov chain $X \to Y \to Z$, we have

$$ I(X;Y) \ge I(X;Z) \implies H(X\mid Y) \le H(X\mid Z).$$

That is, the mutual information of $X$ and $Y$ is greater than $X$ and $Z$; equivalently, the uncertainty about $X$ given $Y$ is less than the uncertainty about $X$ given $Z$.

To motivate what this means, simply think of $X$ as a message, $Y$ as the message transmitted over a channel, and $Z$ as the received message. That is, $Y=f(X)$ and $Z=g(Y)$, are functions of only the previous step in the chain of events. Given this setup, the inequality states that $Y$ contains more information about $X$ than $Z$.

This inequality should intuitively make sense. When you process data, you are often applying a non-invertible transformation (e.g., filtering the signal to remove noise). As implied by "non-invertible", this is a one-way process. Once you've applied a non-invertible transformation, you cannot recover the original signal. You can make orange juice with an orange, but you can't make an orange with orange juice. Similarly, in processing data, the information that is removed is lost to oblivion, which is what is conveyed by the data processing inequality.

## Does modern image enhancement break the data processing inequality?

Following Betteridge's law of headlines, the answer is *no*.

When we look at image enhancement methods like de-noising and super-resolution, the output $Z$ looks like it better reflects what the true image $X$ would be if it were observed under better conditions as compared to the observed noisy or low-resolution image $Y$. Mathematically, it appears like the following statement is true: $I(X;Y)\le I(X;Z)$—a reversal of the inequality.

The state-of-the-art in machine learning methods—most frequently deep neural networks—create images that closely resemble realistic images without noise (inverting the noise process), or images that have a greater resolution than what than what the measurements allow with respect to the Nyquist rate.

An argument can be made that because the images are passed through a set of filter banks whose parameters are learned (i.e., a neural network), the data processing inequality doesn't apply. Vaguely, the hope is that $Z$ is not only a function of $Y$ but something like $Z=f(Y,g(X))$. In what follows, I argue that this is a flimsy hope that doesn't withstand serious scrutiny.

### Applicability of the inequality

As stated before, the data-processing inequality still applies to de-noising and super-resolution. These methods cannot invert an un-invertible function and add lost information. Learning the filters doesn't change the fact that a series of filters—in the case of neural networks—are applied to the image. I'll provide a couple of thought experiments to show why this is the case.

Consider the case where the neural network was randomly initialized to the exact weights that were learned after training. After passing an image through this randomly initialized network, would you say that information was added? The resulting image would be exactly the same as the image generated by the network that arrived at the parameters due to training.

If you think that the learned weights add information but the randomly initialized weights don't, why would one case add information but not the other if the result is the same? What is special about trained parameters $\theta_1$ compared to randomly initialized parameters $\theta_{2}$, when $\theta_1=\theta_2$?

If you think that both sets of parameters add information, would you argue that the data processing is broken when learned and also with some probability $p > 0$ (the case where the weights were randomly initialized to the learned weights)?

I think most reasonable people would see that answering in the affirmative to either argument is absurd, and consequently neither sets of parameters add information.

Admittedly, the above situation would only happen with a vanishingly small (but potentially strictly positive!) probability, and it would only happen with certain initialization schemes. To give a more realistic example, let's compare machine learning methods to plain ol' engineering.

Consider the case of a radio engineer. They inject their *learned* knowledge into a system by *hand-crafting* the modules of the system to accomplish some task. For example, they will multiply a received signal with a sinusoid to baseband the signal, and then low-pass filter the signal to remove harmonics and noise before using it for some other task. I doubt that there are many—if any—information theorists or signal processing experts that would consider this to be a case where the data processing inequality doesn't apply.

For a more traditional image enhancement step, consider the case where you observe an image with salt-and-pepper noise and you apply a median filter—the optimal filter for that type of noise. Does the filtered image now have more information than the original image? Most image processing experts would agree that would not be the case.

If neither of the above cases add information, why would a neural network add information? As mentioned before, when the engineer applies their craft, they are using their learned knowledge to enhance the signal or image. What is so materially different about learned engineering knowledge and learned neural network parameters that one adds information and the other doesn't? What could be so sacred and magical about *data-driven* methods? (As if engineering knowledge wasn't at least partially data-driven.)

To be clear, I am *not* arguing that the learned parameters of a deep neural network aren't related to the observed data. Of course the parameters are related to the data, otherwise the network would do nothing useful. My argument is much more limited: the learned parameters do not add information that was not already contained in the data. There is no guarantee that the de-noised or super-resolved image match what would have been generated if the image were acquired under less noisy conditions or at a higher resolution.

But if information is not added, what's going on? The de-noised and super-resolved images can often be shown to have lower average error in experiments when compared to more traditional image processing methods. The dilemma lies in the very limited definition of information that is implied in the data processing inequality.

### Limitations of the inequality

Let me provide one more example which will more clearly outline the limitations of the definition of information used in the data processing inequality.

Let's consider the case of the information in light, with and without glasses, before the light enters the eye. Loosely speaking, let's describe the case as follows: let $X$ be a vector of the characteristics (e.g., color) of some point on an object; let $Y$ be the (amplitude, frequency, phase, trajectory of the light, etc.) of some light reflected from that point $X$; and let $Z$ be the (amplitude, frequency, phase, trajectory of the light, etc.) of the light $Y$ after having passed through a lens in a pair of glasses. Then we can model $Y = f(X)$ and $Z = g(Y)$ for appropriate functions $f$ and $g$. That is, we have the Markov chain $X \to Y \to Z$.

Then, again, we have a situation where the data processing inequality applies. The data processing inequality then states that $Z$ has less information about $X$ than $Y$. For concreteness, let's say $I(X; Y) = n$ bits and $I(X; Z) = m$ bits for $m \le n$.

Are all the poor simpletons that wear glasses simply misguided due to their lack of education in elementary information theory? I, for one, will keep on wearing my glasses despite the information loss.

This is a confusing result because, as someone with myopia, I perceive much more information in the light that passes through my glasses than without my glasses^{[1]}. This is because my glasses alter the trajectory of the light such that my focal length is different. As a result, the light through my glasses is not aggregated, sub-optimally, to a collection of retinal cells—causing a perceived blur when viewing distant objects. However, the light $Y$, before it enters my eye, contains what information it contains about $X$ ($n$ bits in this case) and no more. My glasses, the function $g$, cannot add more information to $Y$ about $X$ with $Y$ alone, and $Y$, the light, is the only input to my glasses.

A naive application of the data processing inequality can have this kind of result. Unless you're going to do a thorough information-theoretic analysis of each problem—proving that the problem has all the criteria that the relevant theorems require—a more useful way to think about the effect of processing a signal is in answering the following question: What utility does this processing provide to achieve our goal?

The question that leads people to process signals usually isn't strictly about the amount of information in a signal; it is about how humans or machines can efficiently use the information contained in the signal for some other purpose. Clearly the reason people wear glasses is because they are useful; you probably wouldn't want drivers to forego wearing their glasses just so they could have "more information" according to the (naive) information-theoretic definition.

In the radio example I gave previously, the non-baseband, non-filtered signal is useless. It is full of noise and would require extremely expensive equipment to process the signal at the carrier frequency (notwithstanding the noise). Likewise with the salt-and-pepper image example; if the image is meant for human viewing, why not apply the optimal filter? If the image is an artistic photo, the filtered image could be more pleasant. If the image is for some scientific purpose, perhaps the filtered image will be more interpretable.

## Takeaways

While I gave several examples of the general uselessness of the data processing inequality in practical settings, there is a major caveat in this uselessness. The data processing inequality does simply and clearly state that image enhancement techniques like de-noising and super-resolution (outside of cases of multiple measurements or strong assumptions known to be true) cannot recover lost information.

While the resulting images may look better, the image enhancement steps cannot guarantee recovery of the lost information. That is why care must be taken when using these methods in high-stakes applications like medical imaging. It would be reprehensible if de-noising or super-resolution result in the addition or subtraction of a treatable tumor from a clinical image—when normal imaging would not have—and as a result either induces unnecessary treatment (e.g., shooting high-energy particles at the tumor area until the cells die) or goes ignored so long as to condemn the person to death.

Image enhancement techniques need to be motivated by their utility, not their magical abilities. Image enhancement is an important research area because it can be very useful—not because it can accomplish the image processing equivalent of transmuting lead into gold. In natural images, de-noised and super-resolved images can be more visually pleasing, as previously mentioned, and can be used for that purpose.

In medical imaging, they can be used by researchers to collect more stable measurements extracted from images both across subjects and longitudinally (e.g., better whole-brain segmentation). When large sample sizes are used, the individual mistakes due to the methods—if present—can be washed out (assuming no bias). Or more simply, the methods can simply allow a study to happen.

For example, suppose you are a scientist doing research on a disease that is apparent in medical images, and you only have access to a set of clinical images that are very noisy and have a resolution of $1\times 1\times 3 \text{ mm}^3$. You want to segment the diseased portion of the tissue and either do not have the data or technical skill to implement a segmentation deep neural network to do the task. However, there is a publicly-available deep neural network that was trained for this segmentation task on nearly noise-free data at $1 \text{ mm}^3$. In this case, it would be beneficial to first de-noise and super-resolve the images, otherwise the network may perform poorly because of domain shift.

The data processing inequality is a mixed bag. It can be illuminating under certain conditions, but the limited definition of information central to the inequality can be deceiving.

**Addendum**: Side information can be provided about $X$ if you have some independent $T(X)$ or have access to $W$ such that $X = h(W)$. But then we're analyzing a different problem because $X \to Y \to Z$ (assuming you include $T(X)$ or $W$ to create $Z$) doesn't form a Markov chain, so the data processing inequality doesn't apply.

Bayesianism won't save you here. A good prior on $X$ won't add information about $X$ if you only have access to $Y$; a Markov chain $X\to Y \to Z$ is simply a joint distribution that factorizes as $p(x,y,z) = p(x)p(y|x)(z|y)$. The exact, true prior of $X$ is implicit in the proof of the theorem!

Your brain will actually receive more information for technical reasons, but I'm narrowly focusing on the information content of the light itself before it enters the eye. ↩︎