0%

Dreaming of Tensors

In machine learning, the backpropagation algorithm requires us to calculate the gradient of a loss or cost functions c:Rm×nRc:\R^{m\x n}\to \R. In doing so, the chain rule is applied and derivatives of intermediate matrix functions f:Rm×nRm×nf:\R^{m\x n}\to \R^{m\x n} have to be evaluated. This article aims to showcase a practical approach in evaluating such expressions and to further an understanding of the derivative and its use in backpropagation.

The derivative in higher dimensions

one dimension

png

In the above, 1-dimensional example, it is common use the words derivative and slope interchangeably.
For example, by calculating the affine linear approximation hh (also known as the first-order Taylor expansion of ff in point aa), we may write

h(x)=f(a)+f(a)(xa)h(x) = f(a) + f'(a)(x-a)

or by substitution:

h(a+t)=f(a)+f(a)t(1)h(a + t) = f(a) + f'(a)t\tag{1}

and multiply our direction tt with the function’s slope at point aa.

For a single input and a single output the rate of change of the function directly corresponds to the slope of the graph.

In the multivariable, scalarfield case, where a scalar output depends on multiple inputs, we require a direction in which to measure the slope of the graph.

In order to further extend the notion of a derivative to higher dimensions, the multiplication in (1)(1) could be seen as a linear function:

h(a+t)=f(a)+l(t)h(a+t) = f(a) + l(t)

where l(t)=f(a)tl(t) = f'(a)t is a linear mapping from R\R to R\R. Linear mappings from R\R to R\R are noted as L(R,R)\L(\R,\R).
The derivative ff' then is a function RL(R,R)\R \to \L(\R,\R), which takes as input any arbitrary point aa and yields the linear approximation l:RRl:\R \to \R of ff in point aa.

scalar-by-vector

given a scalarfield, a function f:RdRf:\R^d \to \R, the derivative at point xx is the best linear approximation L:RdRL:\R^d \to \R of ff in point xx, denoted as Df(x)Df(x).

Df(x)(v)=if(x)xivi=(f(x))vDf(x)(v) = \sum_i\pd{f(x)}{x_i}v_i = (\nabla f(x))^\top v
where

f(x)=(f(x)x1f(x)xd)\nabla f(x) = \pmatrix{\pd{f(x)}{x_1} \\ \vdots \\ \pd{f(x)}{x_d}}

Reminder:

Df:RdL(Rd,R)xDf(x)\begin{aligned} Df:\R^d &\rightarrow \mathcal L (\R^d,\R) \\ x &\mapsto Df(x) \end{aligned}

Df(x)L(Rd,R)f(x)Rd\underset{\in \L(\R^d,\R)}{Df(x)} \neq \underset{\in \R^d}{\nabla f(x)}

Df(x)Df(x) is actually not f(x)\nabla f(x), since the former is a linear function L(Rd,R)\in \L(\R^d,\R) (also called the dual of Rd\R^d) while the latter is a vector, an element of Rd\R^d. Because of the natural isomorphism L(Rd,R)Rd\L(\R^d,\R) \cong \R^d we may identify a linear map to R\R by a vector (by making use of the scalar-product). I will commit the occasional sin of mixing these notions, for example when writing f(x)=Df(x)\nabla f(x) = Df(x)^\top, but it should be pointed out that these are essentially two different types of objects.

As an interesting side note, ...

the Riesz representation theorem, a fundamental theorem in functional analysis (a generalization of this idea) states that for any Hilbert space H\H (a vector space which is complete by its scalar-product induced norm), each dual element lH=L(H,R)l \in \H^* = \L(\H, \R)) (the space of linearly continuous functionals) is representable by an element ylHy_l \in \H of the space itself (even in the infinite dimensional setting). The application of the functional is equivalent to applying the scalar product to said element: l(x)=<yl,x>Hl(x) = _\H. In the finite dimensional setting, every inner product space is complete and any linear function is continuous.

As before, via Taylor-Expansion, we can compute the first-order approximation of ff by

f(x+tv)f(x)+Df(x)(v)tf(x+tv) \approx f(x) + Df(x)(v)t

Now you might be asking yourself whether to think of the derivative as a gradient or as a linear transformation. The idea of a gradient does not extend well to the multivariate case, i.e. a function f:RdRdf:\R^d \to \R^d with multiple outputs, as it is not immediately an object of your input space anymore. The notion of a linear transformation however still holds. I will sometimes simply refer to derivatives as tensors, leaving the question of what type of object it is in the open.

vector-by-vector

In the multivariate case of f:RdRn,  f(x1,..,xd)=(f1(x1,..,xd)fn(x1,..,xd))f:\R^d \to \R^n, \; f(x_1,..,x_d) = \begin{pmatrix}f_1(x_1,..,x_d) \\ \vdots \\ f_n(x_1,..,x_d)\end{pmatrix} we are looking for a function DfDf that takes a vector as input, and returns a linear transformation L(Rd,Rn)\mathcal L(\R^d, \R^n) which is representable by a matrix Rn×d\R^{n\times d} and the application of the transformation being the equivalent of performing a matrix multiplication. DfJfDf \cong J_f, the Jacobian of the function.

Jf(x)=[f(x)1x1f(x)1xdf(x)nx1f(x)nxd]Rn×dJ_f(x) = \jacobian{f(x)}{x}{n}{d} \in \R^{n\times d}

xRd,vRd:Df(x)(v)=Jf(x)v\forall x \in \R^d, v \in \R^d: Df(x)(v) = J_f(x)v

In this case, DfDf can be equivalently seen as a mapping RdRn×d,  xJf(x)\R^d \to \R^{n\times d}, \; x \mapsto J_f(x)

Some examples:

f(x1,x2)=(x1+x2x22ex1)Jf(x)=[1202x2ex10]Jf(0,2)=[120410]\begin{aligned} f(x_1, x_2) &= \pmatrix{x_1+x_2 \\ x_2^2 \\ e^{x_1}} \\ J_f(x) &= \bmatrix{1 & 2 \\ 0 & 2x_2\\ e^{x_1} & 0} \\ J_f{(0,2)} &= \bmatrix{1 & 2 \\ 0 & 4 \\ 1 & 0} \end{aligned}

This points out that JfJ_f first needs to be evaluated before obtaining the linear approximation.
Other examples include:

matrix multiplication:

f(x)=Ax=(a11x1+..+a1dxdan1x1+..+andxd)Jf(x)=(a11a1dan1and)=A\begin{aligned} f(x) = Ax &= \pmatrix{a_{11}x_1 + .. + a_{1d}x_d \\ \vdots \\ a_{n1}x_1 + .. + a_{nd}x_d} \\ J_f(x) &= \matrixfill{a}{n}{d} = A \end{aligned}

This function is constant and the matrix does not have to be evaluated (the derivative of a linear function is the same for all inputs xx). And

Df(x)(v)=AvDf(x)(v) = Av

activation functions:

f(x)=(σ(x1)σ(x1)),Jf(x)=[σ(x1)000σ(x2)0σ(xd)]f(x) = \pmatrix{\sigma (x_1) \\ \vdots \\ \sigma (x_1)}, J_f(x) = \bmatrix{\sigma '(x_1) & 0 & \cdots & 0 \\ 0 & \sigma '(x_2) & & \vdots\\ \vdots& & \ddots & \\ 0 & \cdots & & \sigma '(x_d)}

Df(x)(h)=(σ(x1)h1σ(xd)hd)=(σ(x1)σ(xd))(h1hd)Df(x)(h) = \pmatrix{\sigma '(x_1)h_1 \\ \vdots \\ \sigma '(x_d)h_d} = \pmatrix{\sigma '(x_1) \\ \vdots \\ \sigma '(x_d)} \odot \pmatrix{h_1 \\ \vdots \\ h_d}

Thus, if an activation function is applied element-wise, the resulting derivative can also be computed element-wise. The application of the derivative can be written using the \odot Hadamard (element-wise) Product.

With the the stage set for the basic derivatives we can have a look at the basic idea of the backpropagation algorithm.

The idea behind backpropagation

When modelling a neural network for example we always need a loss function which we can minimize. The loss function is essentially a function f:RdRf:\R^d \to \R, it takes an input xx and returns a scalar, the loss.

Note, that in the case of a neural network we might be minimizing over the weights WW, then ff is a function Rh2×h1R\R^{h_2\x h_1} \to \R, but this is practically the same as Rh2h1R\R^{h_2\cdot h_1} \to \R. More on this later…

If ff is differentiable, a step of the backpropagation algorithm is performed by computing xnewx_{new} using the update rule xnew=xλf(x)x_{new} = x - \lambda \nabla f(x).

By Taylor-Expansion, we know that:

f(x+tv)=f(x)+Df(x)(v)t+o(t2)f(x+tv) = f(x) + Df(x)(v)t + o(t^2)

where o(t2)o(t^2) is the error-term that decays quadratically in tt (this means that for very small tt this term will be of ‘lesser importance’ and the linear term will outweigh the quadratic t2t^2-term).
Evaluating for xnewx_{new} gives us:

f(xnew)=f(xλf(x))=f(x)+λ(f(x))f(x)+o(λ2)=f(x)λf(x)2+o(λ2)\begin{aligned} f(x_{new}) &= f(x-\lambda \nabla f(x)) \\&= f(x) + - \lambda (\nabla f(x))^\top \nabla f(x) + o(\lambda ^2) \\&= f(x) -\lambda \| \nabla f(x)\|^2 + o(\lambda ^2) \end{aligned}

f(xnew)f(x)=λf(x)2+o(λ2)\Rightarrow f(x_{new}) - f(x) = -\lambda \| \nabla f(x)\|^2 + o(\lambda ^2)

This means that, since the squared norm f(x)2\|\nabla f(x)\|^2 is positive, if we choose λ\lambda small enough, then the right-hand side λf(x)2+o(λ2)-\lambda \| \nabla f(x)\|^2 + o(\lambda ^2) will be negative and thus

f(xnew)f(x)0f(xnew)f(x)\begin{aligned}f(x_{new}) - f(x) &\leq 0 \\ \Rightarrow f(x_{new}) &\leq f(x) \end{aligned}

In neural networks one might have g:RdRmg:\R^d\to \R^m as an intermediate function and f:RmRf:\R^m \to \R with our the cost being c=f(g(x))c = f(g(x)). Using the chain rule, we obtain:

c(x)=(fg)(x)=Jfg(x)=(Jf(g(x))Jg(x))=Jg(x)Jf(g(x))=Jg(x)f(g(x))=Jg(x)c(g(x))\begin{aligned} \nabla c(x) &= \nabla (f\circ g)(x) = J_{f\circ g}(x)^\top \\ &=\left (J_f({g(x)})J_g(x) \right)^\top = J_g(x)^\top J_f({g(x)})^\top \\ &= J_g(x)^\top \nabla f(g(x)) = J_g(x)^\top \nabla c(g(x))\end{aligned}

RdgRmfRxgcxc ⁣Jg(x)c(g)\begin{aligned} &\R^d& &\overset{g}\to& &\R^m& &\overset{f}\to& \R \\ &x& &\mapsto& &g& &\mapsto& c \\ &\nabla _x c& &\overset{J_g(x)^\top}{\leftmapsto}& &\nabla c(g)& \end{aligned}

which illustrates how the gradient propagated from “front to back”. Note that, since the derivatives have to be evaluated at their respective positions this also explains why we would want to store intermediate variables such as gg in a cache as to not have to recompute these values every time.

scalar-by-matrix

One might think that a derivative of a scalar by a matrix would surely be a matrix as well, although following our definition of the derivative as a linear transformation, given f:Rm×nRf:\R^{m\x n} \to \R, we expect the derivative Df(x)Df(x) to be L(Rm×n,R)\in \L(\R^{m\x n}, \R) a linear function that takes as input a matrix and outputs a scalar.

Yet, using standard notation there is no matrix that converts another matrix into a scalar, as applying a matrix ‘from the right’ will look like matrix multiplication. A way to get around this problem would be to vectorize our input, keeping the order of the elements:

XRm×n=(x11x1nxm1xmn)(x11x1nxd1xdn)=XRmnX \in \R^{m\x n}= \matrixfill{x}{m}{n} \cong \pmatrix{x_{11} \\ \vdots \\ x_{1n} \\ \vdots \\ x_{d1} \\ \vdots \\ x_{dn}} = \vec X \in \R^{m\cdot n}

f(X)=f~(X)f(X) = \tilde f(\vec{X})

In this case, we are looking at a function f~:RmnR\tilde f:\R^{m\cdot n} \to \R which we already covered in the scalarvec\pd{scalar}{vec} case. The derivative then is
Df(X)f~(X)RmnDf(X) \cong \nabla \tilde f(\vec{X}) \in \R^{m\cdot n}.

Df~(fx11fx1nfxd1fxdn)D\tilde f \cong \pmatrix{\pd{f}{x_{11}} \cdots \pd{f}{x_{1n}} \cdots \pd{f}{x_{d1}} \cdots \pd{f}{x_{dn}}}

Applying this derivative to a matrix HH:

Df(X)(H)=f~(X)(H)=ijf(X)xijhijDf(X)(H) = \nabla \tilde f(\vec{X}) \cdot (\vec{H}) = \sum_{ij} \pd{f(X)}{x_{ij}}h_{ij}

where \cdot is the vector scalar-product.
Or simply

f(X)xijhij\pd{f(X)}{x_{ij}}h_{ij}

using the Einstein summation convention where corresponding upper and lower indices are summed. For example

ifieiandjfijejk\sum_{i}f_{i}e^{i} \quad \text{and} \quad \sum_{j}f_{ij}e^{jk}

become

fieiandfijejkf_{i}e^{i} \quad \text{and} \quad f_{ij}e^{jk}

We might still decide that we want to write our derivative as a matrix. This is useful in the case that we want to calculate the ‘gradient’ for an update rule or when using the chain rule.

f(X)X[f(X)x11f(X)x1nf(X)xm1f(X)xmn]Rm×n\pd{f(X)}{X}^\top \cong \pdmatrix{f(X)}{x}{m}{n} \in \R^{m\x n}

Thus far the matrix numerator-layout was used. Following this layout the derivative would have to be a n×mn\x m (transposed) matrix. To keep consistency in notation (which is important while applying the chain rule) I will refer to the gradient as fX\pd f X ^\top which has the same dimensions as XX (m×nm\x n).

In the following I will assume that all derivatives are written in their matrix/tensorial form evaluated at their respective inputs and I will refer to the gradient (the transposed tensorial form) as

f(X)X or fXRm×n\pd{f(X)}{X}^\top \text{ or } \pd{f}{X}^\top \in \R^{m\x n}

Df(X)(H)=fXH=([fx11fx1nfxm1fxmn](h11h1nhm1hmn))=[fx11h11fx1nh1nfxm1hm1fxmnhmn]=fxijhij\begin{aligned} Df(X)(H) &= \pd{f}{X}^\top \bullet H \\ &= \sum \left(\pdmatrix{f}{x}{m}{n} \odot \matrixfill{h}{m}{n}\right) \\ &= \sum \matrixeval{\pd{f}{x_{\i\j}}h_{\i\j}}{m}{n} \\ &= \pd{f}{x_{ij}}h_{ij} \\ \end{aligned}

where “\bullet” means we calculate the element-wise product and then sum all entries. “\bullet” is basically the dot scalar-product if we were to vectorize matrices.

example of a scalar-matrix derivative:

An example is the (squared and rescaled) Frobenius matrix norm:

f:Rm×nR,X12XF2=12ijxij2f:\R^{m\x n}\to \R, \quad X \mapsto \frac{1}{2}\|X\|_F^2 = \frac{1}{2}\sum_{ij}x_{ij}^2

f(X)X(x11x1nxm1xmn)=X\pd{f(X)}{X}^\top \cong \matrixfill{x}{m}{n} = X

Df(1m×n)(Y)=1m×nY=i,jyijDf(\mathbb{1}_{m\x n})(Y) = \mathbb{1}_{m\x n} \bullet Y = \sum_{i,j} y_{ij}

Here, I also want to note that depending on how we wish to contract a tensor

fxijRm×n\pd{f}{x_{ij}} \cong \R^{m\x n}

this can be seen as an element L(Rm,Rn)\in \L(\R^m,\R^n), which corresponds to normal matrix-multiplication:

fxijaj where aRm\pd{f}{x_{ij}}a_j \text{ where $a \in \R^m$}

an element L(Rn,Rm)\in \L(\R^n,\R^m) (transposed matrix-multiplication):

bifxij where bRnb_i\pd{f}{x_{ij}} \text{ where $b \in \R^n$}

or an element L(Rm×n,R)\in \L(\R^{m\x n},\R), as we have seen:

fxijhij where HRm×n\pd{f}{x_{ij}}h_{ij} \text{ where $H \in \R^{m\x n}$}

just by noting the indices along which we wish to contract a tensor. In this sense, the Einstein notation proves to be very elegant.

matrix-by-matrix

Assume we have a function A:Rp×qRm×nA:\R^{p\x q} \to \R^{m\x n} and c:Rm×nRc:\R^{m\x n} \to \R (this could be an activation layer and a cost function in a neural network).

Rp×qRm×nRXAc\begin{aligned} &\R^{p\x q}& &\to& &\R^{m\x n}& &\to& \R \\ &X& &\mapsto& &A& &\mapsto& c \\ \end{aligned}

Applying the chain rule:

cxkl=caijaijxkl\pd{c}{x_{kl}} = \pd{c}{a_{ij}}\pd{a_{ij}}{x_{kl}}

We eventually come across the term

aijxkl\pd{a_{ij}}{x_{kl}}

which, as the 4 indices indicate, is a rank-4 tensor.

Since we are interested in the gradient, we wish to calculate (again, sorry for the abuse of notation)

cXRp×q,cX=(cAAX)=AXcA\pd{c}{X}^\top \in \R^{p\x q}, \quad \pd{c}{X}^\top = \left(\pd{c}{A}\pd{A}{X}\right)^\top = \pd{A}{X}^\top\pd{c}{A}^\top

where

cA=[ca11ca1ncam1camn]Rm×n\pd{c}{A}^\top = \pdmatrix{c}{a}{m}{n} \in \R^{m\x n}

AX=[Ax11Ax1qAxp1Axpq]R(p×q)×(m×n)\pd{A}{X}^\top = \pdmatrix{A}{x}{p}{q} \in \R^{(p\x q)\x(m\x n)}

And R(p×q)×(m×n)\R^{(p\x q)\x(m\x n)} is seen as a linear transformation L(Rm×n,Rp×q)\L(\R^{m\x n},\R^{p\x q})

The single entries of the gradient are

cxkl=aijxklcaij\pd{c}{x_{kl}} = \pd{a_{ij}}{x_{kl}}\pd{c}{a_{ij}}

Note: this formula encapsules everything one needs to know on how to calculate a derivative. However, keeping track of all those indices can get confusing. The way one could proceed to evaluate such an expression while keeping the spatial structure of the matrix is as follows.

Using the recently introduced notation:

AxklcA=AxklcA\pd{A}{x_{kl}} \bullet \pd{c}{A}^\top = \sum \pd{A}{x_{kl}} \odot \pd{c}{A}^\top

, where we sum over all free indices.
This means we can expand the tensor along the p×qp\x q dimensions and contract the other dimensions using the Hadamard product and summing. In almost the same fashion that we can calculate each entry of a matrix product by overlaying column and row vectors and then summing, here we overlay the matrix cA\pd{c}{A}^\top with each Axkl\pd{A}{x_{kl}} (also a matrix) and then sum.

In total, we then obtain:

cX=AXcA=[aijx11caijaijx1qcaijaijxp1caijaijxpqcaij]\pd{c}{X}^\top = \pd{A}{X}^\top \pd{c}{A}^\top = \matrixeval{\pd{a_{ij}}{x_{\i\j}} \pd{c}{a_{ij}}}{p}{q}

=[Ax11cAAx1qcAAxp1cAAxpqcA]= \matrixplace{ \color{darkviolet}{\pd{A}{x_{11}}} \color{black}{\bullet}\color{dodgerblue}{\pd{c}{A}^\top} }{ \pd{A}{x_{1q}} \bullet \color{dodgerblue}{\pd{c}{A}^\top} }{ \pd{A}{x_{p1}} \bullet \color{dodgerblue}{\pd{c}{A}^\top} }{ \pd{A}{x_{pq}} \bullet \color{dodgerblue}{\pd{c}{A}^\top} }

=[[a11x11ca11a1nx11ca1nam1x11cam1amnx11camn][a11x1qca11a1nx1qca1nam1x1qcam1amnx1qcamn][a11xp1ca11a1nxp1ca1nam1xp1cam1amnxp1camn][a11xpqca11a1nxpqca1nam1xpqcam1amnxpqcamn]]= \matrixplace{ \sum\matrixeval{ \color{darkviolet}{ \pd{a_{\i\j}}{x_{11}} } \color{dodgerblue}{ \pd{c}{a_{\i\j}} } }{m}{n} }{ \sum\matrixeval{ \pd{a_{\i\j}}{x_{1q}} \color{dodgerblue}{ \pd{c}{a_{\i\j}} } }{m}{n} }{ \sum\matrixeval{ \pd{a_{\i\j}}{x_{p1}} \color{dodgerblue}{ \pd{c}{a_{\i\j}} } }{m}{n} }{ \sum\matrixeval{ \pd{a_{\i\j}}{x_{pq}} \color{dodgerblue}{ \pd{c}{a_{\i\j}} } }{m}{n} }

An example might clear things up.


Given an affine linear function

Z(W)=WX+bZ(W) = WX + b

where XRq×nX \in \R^{q\x n}

Rp×qRp×nWZ\begin{aligned} &\R^{p\x q}& &\to& &\R^{p\x n}& \\ &W& &\mapsto& &Z& \\ \end{aligned}

The gradient can be written as

ZW=[Zw11Zw1nZwm1Zwmn]R(p×q)×(p×n)\pd{Z}{W}^\top = \pdmatrix{Z}{w}{m}{n} \in \R^{(p\x q)\x(p\x n)}

, as a linear transformation L(Rp×n,Rp×q)\L(\R^{p\x n},\R^{p\x q}).

The constant part bb is not relevant when calculating the derivative and can be ignored.

WX=(w11w1qwp1wpq)(x11x1nxr1xrn)=[kw1kxk1kw1kxknkwmkxk1kwmkxkn]\begin{aligned} WX &= \matrixfill{w}{p}{q} \matrixfill{x}{r}{n} \\ &= \matrixeval{\sum_k w_{\i k}x_{k\j}}{m}{n} \end{aligned}

Evaluating Zwij\pd{Z}{w_{ij}} gives

Zw11=[x11x12x1n00000000]Zw12=[x21x22x2n00000000]Zw21=[000x11x12x1n00000]Zw22=[000x21x22x2n00000]\begin{matrix} \pd{Z}{w_{11}} = \bmatrix{ x_{11} & x_{12} & \cdots & x_{1n} \\ 0 & 0 & & 0 \\ \vdots & & \ddots & 0 \\ 0 & 0 & 0 & 0 \\ } & \pd{Z}{w_{12}} = \bmatrix{ x_{21} & x_{22} & \cdots & x_{2n} \\ 0 & 0 & & 0 \\ \vdots & & \ddots & 0 \\ 0 & 0 & 0 & 0 \\ } & \cdots \\ \\ \pd{Z}{w_{21}} = \bmatrix{ 0 & 0 & & 0 \\ x_{11} & x_{12} & \cdots & x_{1n} \\ \vdots & & \ddots & 0 \\ 0 & 0 & 0 & 0 \\ } & \pd{Z}{w_{22}} = \bmatrix{ 0 & 0 & & 0 \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & & \ddots & 0 \\ 0 & 0 & 0 & 0 \\ } \\ \vdots \end{matrix}

Recalling how to apply the derivative ZW\pdt Z W to a matrix HH (when training neuronal nets, this is typically dZdZ), we multiply each of the above submatrices with HH element-wise and then sum all entries.

ZW(H)=[h1kx1kh1kxnkhmkx1khmkxnk]=HX\pdt Z W (H) = \matrixeval{\sum h_{\i k}x_{\j k}}{m}{n} = HX^\top

The next example was the motivation for writing this article.

cross-entropy loss & softmax

given functions A:Rm×nRm×nA:\R^{m\x n} \to \R^{m\x n} and c:Rm×nRc:\R^{m\x n} \to \R

Rm×nRm×nRXAc\begin{aligned} &\R^{m\x n}& &\to& &\R^{m\x n}& &\to& \R \\ &X& &\mapsto& &A& &\mapsto& c \\ \end{aligned}

where cc is the cross-entropy loss:

c(A)=1nijyijln(aij)c(A) = -\frac{1}{n}\sum_{ij}y_{ij}ln(a_{ij})

and AA is the softmax-activation:

A(X)=[ex11zexz1ex1nzexznexm1zexz1exmnzexzn]A(X) = \matrixeval{\frac{e^{x_{\i\j}}}{\sum_ze^{x_{z\j}}}}{m}{n}

where the column vectors are normalized by the sum of each column, so that they add to 1.

Using our chain rule for the gradient:

cX=AXcA\pd{c}{X}^\top = \pd{A}{X}^\top \pd{c}{A}^\top

cA=[ca11ca1ncam1camn]=1n[y11a11y1na1nym1am1ymnamn]=1nYA\begin{aligned} \pd{c}{A}^\top &=&\matrixeval{\pd{c}{a_{\i\j}}}{m}{n}& \\ &=& -\frac{1}{n} \matrixeval{ \frac{y_{\i\j}}{a_{\i\j}} }{m}{n}& =-\frac{1}{n} Y \oslash A \end{aligned}

as \odot is the operator for element-wise multiplication, \oslash will denote element-wise division.

AX=[Ax11Ax1qAxp1Axpq]R(p×q)×(m×n)\pd{A}{X}^\top = \pdmatrix{A}{x}{p}{q} \in \R^{(p\x q)\x(m\x n)}

as evaluating

xkl(exijzexzj)=0 if l  j\pd{}{x_{kl}} \left( \frac{e^{x_{ij}}}{\sum_ze^{x_{zj}}} \right) = 0 \quad \text{ if l $\neq$ j}

becomes a headache with all those indices, I will go through the first few examples. The pattern will then become clear.

a11x11=x11(ex11zexz1)=ex11zexz1ex111(zexz1)2ex11=ex11zexz1(1ex11zexz1)=a11(1a11)a21x11=x11(ex21zexz1)=0ex211(zexz1)2ex11=a11a21a31x11=a11a31\begin{aligned} \pd{a_{11}}{x_{11}} &= \pd{}{x_{11}} \left( \frac{e^{x_{11}}}{\sum_ze^{x_{z1}}} \right) \\ &= \frac{e^{x_{11}}}{\sum_ze^{x_{z1}}} - e^{x_{11}} \frac{1}{\left( \sum_ze^{x_{z1}} \right)^2}e^{x_{11}} \\ &= \frac{e^{x_{11}}}{\sum_ze^{x_{z1}}}\left(1 - \frac{e^{x_{11}}}{\sum_ze^{x_{z1}}}\right) \\ &= a_{11}(1-a_{11}) \\ \pd{a_{21}}{x_{11}} &= \pd{}{x_{11}} \left( \frac{e^{x_{21}}}{\sum_ze^{x_{z1}}} \right) \\ &= 0 - e^{x_{21}} \frac{1}{\left( \sum_ze^{x_{z1}} \right)^2}e^{x_{11}} \\ &= -a_{11}a_{21} \\ \pd{a_{31}}{x_{11}} &= -a_{11}a_{31} \\ &\vdots \end{aligned}

and if the columns are not aligned:

a12x11=0\pd{a_{12}}{x_{11}} = 0

In full, we have:

Ax11=[a11(a111)00a11a2100a11a3100a11am100]=A[(a111)00a1100a1100a1100]\begin{aligned} \pd{A}{x_{11}} &= \bmatrix{ -a_{11}(a_{11}-1) & 0 & & 0 \\ -a_{11}a_{21} & 0 & \cdots & 0 \\ -a_{11}a_{31} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ -a_{11}a_{m1} & 0 & \cdots & 0 \\ } \\ &= -A \odot \bmatrix{ (a_{11}-1) & 0 & & 0 \\ a_{11} & 0 & \cdots & 0 \\ a_{11} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ a_{11} & 0 & \cdots & 0 \\ } \end{aligned}

for all entries:

Ax11=A[(a111)00a1100a1100a1100]Ax12=A[0(a121)000a12000a12000a1200]Ax21=A[a2100(a211)00a2100a2100]Ax22=A[0a22000(a221)000a22000a2200]\begin{matrix} \pd{A}{x_{11}} = -A \odot \bmatrix{ (a_{11}-1) & 0 & & 0 \\ a_{11} & 0 & \cdots & 0 \\ a_{11} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ a_{11} & 0 & \cdots & 0 \\ } & \pd{A}{x_{12}} = -A \odot \bmatrix{ 0 & (a_{12}-1) & 0 & & 0 \\ 0 & a_{12} & 0 & \cdots & 0 \\ 0 & a_{12} & 0 & & 0 \\ & \vdots & & \ddots & \vdots \\ 0 & a_{12} & 0 & \cdots & 0 \\ } & \cdots \\ & \\ \pd{A}{x_{21}} = -A \odot \bmatrix{ a_{21} & 0 & & 0 \\ (a_{21}-1) & 0 & \cdots & 0 \\ a_{21} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ a_{21} & 0 & \cdots & 0 \\ } & \pd{A}{x_{22}} = -A \odot \bmatrix{ 0 & a_{22} & 0 & & 0 \\ 0 & (a_{22}-1) & 0 & \cdots & 0 \\ 0 & a_{22} & 0 & & 0 \\ & \vdots & & \ddots & \vdots \\ 0 & a_{22} & 0 & \cdots & 0 \\ } & \cdots \\ \vdots \end{matrix}

Now, when contracting each of these matrices with cA=1nYA\pd{c}{A}^\top = -\frac{1}{n}Y\oslash A:

For each entry, we ‘overlay’ the matrices and sum (overlaying means applying the element-wise product).

[cX]11=A[(a111)00a1100a1100a1100]1nYA=1n[y11(a111)00y21a1100y31a1100ym1a1100]=1n(a11(iyi1)y11)\begin{aligned} \left[\pd{c}{X}\right]^\top_{11} &= \sum -A \odot \bmatrix{ (a_{11}-1) & 0 & & 0 \\ a_{11} & 0 & \cdots & 0 \\ a_{11} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ a_{11} & 0 & \cdots & 0 \\ } \odot -\frac{1}{n}Y\oslash A \\ &= \frac{1}{n} \sum \bmatrix{ y_{11}(a_{11}-1) & 0 & & 0 \\ y_{21}a_{11} & 0 & \cdots & 0 \\ y_{31}a_{11} & 0 & & 0 \\ \vdots & & \ddots & \vdots \\ y_{m1}a_{11} & 0 & \cdots & 0 \\ } \\ &= \frac{1}{n}(a_{11}\left(\sum_i y_{i1}\right) - y_{11}) \end{aligned}

As all operations are element-wise, we can rearrange the operations, just like with normal multiplication.

if Y is a probability distribution, then the columns will each sum to one: iyi1=1\sum_i y_{i1} = 1. And we obtain cx11=1n(a11y11)\pd{c}{x_{11}} = \frac{1}{n}(a_{11} - y_{11})

In total we have:

cX=[1n(a11y11)1n(a12y12)1n(a21y21)1n(a22y22)]\pd{c}{X}^\top = \bmatrix{ \frac{1}{n}(a_{11} - y_{11}) &\frac{1}{n}(a_{12} - y_{12}) & \cdots \\ & \\ \frac{1}{n}(a_{21} - y_{21}) &\frac{1}{n}(a_{22} - y_{22}) \\ & \\ \vdots & & \ddots }

Thus the gradient is simplified to:

cX=1n(AY)\pd{c}{X}^\top = \frac{1}{n}( A - Y)