Fun with statistics – transformations of random variables part 1

Here is another tool I learned about and have found very useful ever since then: transformations of random variables. The basic idea is this: Given a random variable $X$ from which we know how it is distributed, i.e., either its probability density function (PDF) $f_X(x)$ or its cumulative distribution function (CDF) $F_X(x)$, find the distribution of a new random variable $Y$ formed as $Y=g(X)$, where $y = g(x)$ is a function from $\real$ to $\real$. For example, we may wonder what is the distribution of a squared normal random variable or of the logarithm of a uniform random variable. Tasks like this appear suprisingly often in engineering. Communications are a good example: Your signal of interest is a random process containing random desired information, as well as random undesired perturbations like noise. To find an optimal detector for your information of interest you need to know exactly how it is distributed. Now what happens if your receiver ist best described by some nonlinear function? How do the known (or assumed) distributions of your desired signal and your perturbation change?

Fortunately, the basic procedure of answering the above problem is strikingly simple. There are two basic approaches: one based on the CDF and one based on the PDF. Let us start with the CDF. To find the CDF of $Y$ we need to think what it is: the probability that our random variable $Y$ is less than or equal to some given $y$. Inserting $Y=g(X)$ we can try to solve this for the CDF of $X$:

$$F_Y(y) = {\rm Pr}[Y \leq y] = {\rm Pr}[g(X) \leq y] = {\rm Pr}[X \leq g^{-1}(y)] = F_X(g^{-1}(y))$$.

Really? So simple? Well, not quite. The above step only works if $g(x) = y$ is bijective, i.e., uniquely invertible, and monotonically increasing. The tricky step if of course rewriting the condition $g(X) \leq y$ in terms of a condition on $X$. Here, care needs to be taken. For example, say $g(x) = a\cdot x + b$. Then $g(X) \leq y$ becomes $X \leq (y-b)/a$ only for $a>0$, for $a<0$ we have $X \geq (y-b)/a$. Of course this still admits a simple solution in terms of the CDF since ${\rm Pr}[X \geq (y-b)/a] = 1-{\rm Pr}[X \leq (y-b)/a] + {\rm Pr}[X=(y-b)/a]$ which becomes $1-F_X((y-b)/a)$ if $X$ is continuous ($F_X(x)$ is differentiable).

For other functions there may be multiple solutions. For instance, consider $g(x) = y^2$. This gives $F_Y(y) = {\rm Pr}[X^2 \leq y] = {\rm Pr}[-\sqrt{y} \leq X \leq +\sqrt{y}] = F_X(\sqrt{y}) – F_X(-\sqrt{y})$.

What if we want to work with the PDF instead of the CDF? Sometimes the CDF is not available or very complicated to work with. In this case, we can find a similar process for PDFs. It is easily derived from the previous approach on the CDF by taking its derivative and using the chain rule twice. We find that the PDF of $Y$ can be directly computed as

$$f_Y(y) = \sum_{i=1}^n \frac{1}{|g'(x_i(y))|} f_X(x_i(y)),$$

where $x_i(y)$ are all solutions of $y=g(x)$ in terms of $y$ for $i=1, 2, \ldots, N$. Furthermore, $g'(x)$ is the first derivative of $g(x)$ with respect to $x$. Example. Consider $y=g(x) = x^2$. It has two solutions in $x$: $x_1(y) =+\sqrt{y}$ and $x_2(y) = -\sqrt{y}$. We find $g'(x) = 2x$ and inserting $x_i(y)$ we have $g'(x_i(y)) = \pm 2 \sqrt{y}$. Therefore, the final solution for $f_Y(y)$ becomes

$$f_Y(y) = \frac{1}{|2\sqrt{y}|}f_X(\sqrt{y}) + \frac{1}{|-2\sqrt{y}|} f_X(-\sqrt{y})= \frac{1}{2\sqrt{y}}\left(f_X(\sqrt{y})+f_X(-\sqrt{y})\right).$$

Of course, the entire process has some limitations, the main one being that we need to be able to solve $g(x) = y$ for $x$. This becomes tedious if there are infinitely many solutions, as for instance for $g(x) = \sin(x)$. Even more so, it becomes impossible for many functions. Try $g(x) = {\rm sinc}(x) = \sin(x)/x$ for instance. Even the number of solutions $N$ depends on $y$ and finding $x_i(y)$ explicitly is not really possible.

Still for a wide range of problems this is a quite handy tool.

More algebra fun

I recently posted on algebraic rules for rewriting linear and quadratic forms and showed how Kronecker products come into play when expanding products of matrices.

Likewise, we might come into a situation where parameters of interest are inside a Kronecker product which we want to “pull out”. As the operation is still linear this should be an easy step. Strangely at the time I needed it I could not find the algebraic rules for it anywhere. I found one by trial and error for a special case and my colleague Martin Weis came up with the much more general case. Here it is for your convenience:

Let $\ma{A} \in \compl^{M \times N}$ and $\ma{B} \in \compl^{P \times Q}$. Then, $\ma{A} \otimes \ma{B}$ is a matrix with depends on the elements of $\ma{A}$ as well as the elements of $\ma{B}$ linearly. Consequently, ${\rm vec}(\ma{A} \otimes \ma{B})$ must be linear in ${\rm vec}(\ma{A})$ and ${\rm vec}(\ma{B})$. Indeed it is and the coefficient matrices are found as:

$${\rm vec}(\ma{A} \otimes \ma{B}) = \left( \ma{I}_N \otimes \left[ \begin{array}{c} \ma{I}_M \otimes \ma{b}_1 \\ \ma{I}_M \otimes \ma{b}_2 \\ \vdots \\ \ma{I}_M \otimes \ma{b}_Q \end{array} \right] \right) \cdot {\rm vec}(\ma{A}) = \left[ \begin{array}{c} \ma{I}_Q \otimes \ma{a}_1 \otimes \ma{I}_P \\ \ma{I}_Q \otimes \ma{a}_2 \otimes \ma{I}_P \\ \vdots \\ \ma{I}_Q \otimes \ma{a}_N \otimes \ma{I}_P \end{array} \right] \cdot {\rm vec}(\ma{B}).$$

Special cases where either of $M, N, P, Q$ are equal to one (i.e., one or two matrices are row or column vectors) are easily found from this, e.g.,
${\rm vec}(\ma{a} \otimes \ma{B}) = \left[\ma{I}_Q \otimes \ma{a} \otimes \ma{I}_P\right] \cdot {\rm vec}(\ma{B})$ or ${\rm vec}(\ma{A} \otimes \ma{b}) = \left[\ma{I}_{MN} \otimes \ma{b} \right] \cdot {\rm vec}(\ma{A})$.

A related important observation is that the matrix $\ma{A} \otimes \ma{B}$ and the matrix ${\rm vec}(\ma{B}) \cdot {\rm vec}(\ma{A})^{\rm T}$ contain exactly the same elements. Therefore, you can permute a given Kronecker product so that it becomes a rank one matrix, where each column is a scaled version of ${\rm vec}(\ma{B})$ and each row a scaled version of ${\rm vec}(\ma{A})^{\rm T}$. This is particularly relevant when we have an estimate of a matrix that should be a Kronecker product and want to find a “Least-Squares Kronecker factorization”. By performing the above mentioned rearrangement, this problem is transformed into finding a Least-Squared rank-one approximation which is readily solved by a truncated Singular Value Decomposition (SVD) via the Eckart-Young theorem.

Starting from this I wondered what happens if we consider a rank higher than one. This allows us to decompose a given matrix into the sum of Kronecker products and find exact as well as approximate low-rank decompositions with a Least-Squares optimality guarantee. Unfortunately, I found out later that others had this idea before. A decomposition like that is known as Kronecker Product SVD. Here is a slide set by C. Van Loan on it.