JAX Fundamentals (Part \(1\))

JAX_tutorial
Posted in Blog | Leave a comment

Monte Carlo Methods

Problem: Distinguish between Las Vegas methods and Monte Carlo methods.

Solution: Both are umbrella terms referring to broad classes of methods that draw (repeatedly) from (not necessarily i.i.d.) random variables to compute the value of some deterministic variable. Here, “compute” means “find that deterministic value with certainty” in the case of a Las Vegas method whereas “compute” means “(point) estimate” in the case of a Monte Carlo method. Thus, their key difference lies in how they tradeoff time \(t\) vs. accuracy \(\alpha\):

Problem: Given a scalar continuous random variable with probability density function \(p(x)\), how can one sample faithfully from such an \(x\sim p(x)\)?

Solution: Fundamentally, a computer can (approximately) draw from a uniform random variable \(u\in [0,1]\) (using seed-based pseudorandom number generators for instance). In order to convert this \(u\to x\), the standard way is to use the cumulative distribution function of \(x\):

\[\int_{-\infty}^xdx’p(x’):=u\]

Then, as the probability density of \(u\) is uniform \(p(u)=1\), the probability density of \(x\) will be \(p(x)=\frac{du}{dx}p(u)\). As the CDF is a monotonically increasing bijection, there always exists an \(x=x(u)\) for each \(u\in[0,1]\). And intuitively, this whole sampling scheme should make a lot of sense:

Problem: So, it seems like the above “inverse CDF trick” has solved the problem of sampling from an arbitrary random variable \(x\) (and it generalizes to the case of a higher-dimensional random vector \(\mathbf x\in\mathbf R^n\) by sampling conditionally). So how come in many practical applications this “sampling” problem can still be difficult?

Solution: In practice, one may not have access to \(p(\mathbf x)\) itself, but only the general shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) of it (e.g. in Bayesian inference). One might object that \(Z\) is known, namely \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\) can be written explicitly as an \(n\)-dimensional integral, so wherein lies the difficulty? The difficulty lies in evaluating the integral, which suffers from the curse of dimensionality if \(n\gg 1\) is large (just imagine approximating the integral by a sum, and then realizing that the number of terms \(N\) in such a series would grow as \(N\sim O(\exp n)\)). The inverse CDF trick thus fails because it requires one to have \(p(\mathbf x)\) itself, not merely \(\tilde p(\mathbf x)\).

Problem: Describe the Monte Carlo method known as importance sampling for estimating the expectation \(\langle f(\mathbf x)\rangle\) of a random variable \(f(\mathbf x)\).

Solution: Importance sampling consists of a deliberate reframing of the integrand of the expectation inner product integral:

\[\langle f(\mathbf x)\rangle=\int d^n\mathbf x p(\mathbf x)f(\mathbf x)=\int d^n\mathbf x q(\mathbf x)\frac{p(\mathbf x)f(\mathbf x)}{q(\mathbf x)}\]

in other words, replacing \(p(\mathbf x)\mapsto q(\mathbf x)\) and \(f(\mathbf x)\mapsto p(\mathbf x)f(\mathbf x)/q(\mathbf x)\). There are \(2\) cases in which importance sampling can be useful. The first case is related to the previous problem, namely when sampling directly from \(p(\mathbf x)\) is too difficult, in which case \(q(\mathbf x)\) should be chosen to be easier to both sample from and evaluate than \(p(\mathbf x)\). The second case is, even if sampling directly from \(p(\mathbf x)\) is feasible, it can still be useful to instead sample from \(q(\mathbf x)\) as a means of reducing the variance of the Monte Carlo estimator. That is, whereas the original Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws \(\mathbf x_1,…,\mathbf x_N\) from \(p(\mathbf x)\) is \(\sum_{i=1}^Nf(\mathbf x_i)/N\) with variance \(\sigma^2_{f(\mathbf x)}/N\), the new Monte Carlo estimator of the expectation involving \(N\) i.i.d. draws from \(q(\mathbf x)\) is \(\sum_{i=1}^Np(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)N\) with variance \(\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}/N\); for importance sampling to be useful, \(q(\mathbf x)\) should be chosen such that:

\[\sigma^2_{p(\mathbf x)f(\mathbf x)/q(\mathbf x)}<\sigma^2_{f(\mathbf x)}\]

One can explicitly find the function \(q(\mathbf x)\) that minimizes the variance functional of \(q(\mathbf x)\) (using a Lagrange multiplier to enforce normalization \(\int d^n\mathbf x q(\mathbf x)=1\)):

\[q(\mathbf x)\sim p(\mathbf x)|f(\mathbf x)|\]

In and of itself, this \(q(\mathbf x)\) cannot be used because it contains the difficult piece \(p(\mathbf x)\). Nevertheless, motivated by this result, one can heuristically importance sample using a \(q(\mathbf x)\) which ideally is large at \(\mathbf x\in\mathbf R^n\) where \(p(\mathbf x)|f(\mathbf x)|\) is large. Intuitively, this is because that’s the integrand in the expectation inner product integral, so locations where it’s large will contribute most to the integral, hence are the most “important” regions to sample from with \(q(\mathbf x)\).

One glaring flaw with the above discussion is that if \(Z\) is not known, then it will also be impossible to sample from \(p(\mathbf x)\). Instead, suppose as before that only the shape \(\tilde p(\mathbf x)=Zp(\mathbf x)\) is known with \(Z=\int d^n\mathbf x\tilde p(\mathbf x)\). Thus, the expectation is:

\[\frac{\int d^n\mathbf x\tilde{p}(\mathbf x)f(\mathbf x)}{\int d^n\mathbf x\tilde{p}(\mathbf x)}\]

The key is to notice that the denominator is just the \(f(\mathbf x)=1\) special case of the numerator, hence the importance sampling trick can be applied to both numerator and denominator, thereby obtaining the (biased but asymptotically unbiased Monte Carlo estimator):

\[\frac{\sum_{i=1}^N\tilde p(\mathbf x_i)f(\mathbf x_i)/q(\mathbf x_i)}{\sum_{i=1}^N\tilde p(\mathbf x_i)/q(\mathbf x_i)}\]

for the expectation.

Problem: Describe what Markov chain Monte Carlo (MCMC) methods are about.

Solution: Whereas in “naive” Monte Carlo methods, the random vector draws \(\mathbf x_1,…,\mathbf x_N\) are often i.i.d., here the key innovation is that the sequence of random vectors is no longer i.i.d., rather \(\mathbf x_t\) is correlated with \(\mathbf x_{t-1}\) (but uncorrelated with all the earlier \(\mathbf x_{<t-1}\). In other words, the sequence \(\mathbf x_1,…,\mathbf x_N\) forms a (discrete-time) Markov chain, i.e. there exists a transition kernel with “matrix elements” \(p(\mathbf x’|\mathbf x)\) for any pair of states \(\mathbf x,\mathbf x’\in\mathbf R^n\) (assuming the DTMC is \(t\)-independent/homogeneous). The goal of an MCMC method is to find a suitable \(p(\mathbf x’|\mathbf x)\) such that the stationary \(N\to\infty\) distribution of the chain matches \(p(\mathbf x)\). Thus, it’s basically an “MCMC sampler”.

Problem: What does it mean to burn in an MCMC run? What does it mean to thin an MCMC run?

Solution: Given a DTMC \(\mathbf x_1,…,\mathbf x_N\), the idea of burning-in the chain is to discard the first few samples \(\mathbf x_1,\mathbf x_2,…\) as they’re correlated with the (arbitrary) choice of starting \(\mathbf x_1\) and so are not representative of \(p(\mathbf x)\) (equivalently, one hopes to burn-in for the first \(N_b\) samples where \(N_b\) is around the mixing time of the Markov chain). Burn-in therefore gives the MCMC run a chance to find the high-probability regions of \(p(\mathbf x)\).

Thinning an MCMC run means only keep e.g. every \(10\) DTMC points or something like that…this is to reduce the correlation between adjacent samples as the goal is to sample independently from \(p(\mathbf x)\).

Problem: The Metropolis-Hastings method is a general template for a fairly broad family of MCMC methods. Explain how it works.

Solution: In order for \(p(\mathbf x)\) to be the stationary distribution of a DTMC, by definition one requires it to be an eigenvector (with eigenvalue \(1\)) of the transition kernel:

\[\int d^n\mathbf xp(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x’)\]

A sufficient (though not necessary) condition for this is if the chain is in detailed balance:

\[p(\mathbf x’|\mathbf x)p(\mathbf x)=p(\mathbf x|\mathbf x’)p(\mathbf x’)\]

for all \(\mathbf x,\mathbf x’\in\mathbf R^n\). The idea of Metropolis-Hastings is to conceptually decompose the transition from state \(\mathbf x\to\mathbf x’\) into \(2\) simpler steps, namely:

  1. Propose the transition \(\mathbf x\to\mathbf x’\) with proposal probability \(p_p(\mathbf x’|\mathbf x)\) (this distribution \(p_p\) needs be easy to sample \(\mathbf x’\) from for any \(\mathbf x\)!)
  2. Accept/reject the proposed transition \(\mathbf x\to\mathbf x’\) with an acceptance probability \(p_a(\mathbf x’|\mathbf x)\) (this distribution also needs to be easy to sample from!)

Thus, \(p(\mathbf x’|\mathbf x)=p_a(\mathbf x’|\mathbf x)p_p(\mathbf x’|\mathbf x)\). The ratio of acceptance probabilities from detailed balance is therefore:

\[\frac{p_a(\mathbf x’|\mathbf x)}{p_a(\mathbf x|\mathbf x’)}=\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p_p(\mathbf x)}\]

The ratio on the RHS is known (furthermore, because it’s a ratio, neither \(p_p\) nor \(p\) need be normalized! In other words, one can replace \(p_p\mapsto \tilde p_p\) and \(p\mapsto\tilde p\)). The task remains to specify the form of \(p_a(\mathbf x’|\mathbf x)\). The detailed balance condition doesn’t give a unique solution for \(p_a\), but heuristically one can get a unique solution by further insisting that the walker accept moves often. That is, w.l.o.g. suppose the RHS is computed to be \(>1\), so that \(p_a(\mathbf x’|\mathbf x)>p_a(\mathbf x|\mathbf x’)\). This implies that the transition from \(\mathbf x\to\mathbf x’\) is in some sense more variable than the reverse transition \(\mathbf x’\to\mathbf x\), so it seems that if at time \(t\) the MCMC walker were at \(\mathbf x_t=\mathbf x\), then at time \(t+1\) the MCMC walker should definitely move to state \(\mathbf x_{t+1}:=\mathbf x’\). In other words, one would like to set \(p_a(\mathbf x’|\mathbf x):=1\) and hence \(p_a(\mathbf x|\mathbf x’)=\frac{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}{p_p(\mathbf x|\mathbf x’)p_p(\mathbf x’)}\). But this therefore fixes the form of the function \(p_a\)! Specifically, to evaluate \(p_a(\mathbf x’|\mathbf x)\), one should set it to be \(1\) if the RHS ratio is \(>1\) (which coincidentally is also the value that \(p_a\) itself needs to be set to), whereas if the RHS ratio is \(<1\), then \(p_a\) should be set to the RHS ratio. An elegant way to encode all of this into a single compact expression is just:

\[p_a(\mathbf x’|\mathbf x)=\min\left(1,\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}\right)\]

Practically, to actually implement this Metropolis-Hastings acceptance probability \(p_a\), one can essentially flip a “biased” coin with probability \(p_a(\mathbf x’|\mathbf x)\) of landing “heads” (meaning to accept the transition \(\mathbf x\to\mathbf x’\)), e.g. randomly picking a number \(u\in[0,1]\) and declaring acceptance iff \(u\leq p_a(\mathbf x’|\mathbf x)\).

(aside: if the proposal probability \(p_p(\mathbf x’|\mathbf x)\sim e^{-|\mathbf x’-\mathbf x|^2/2\sigma^2}\) is taken to be normally and isotropically distributed about \(\mathbf x\), then it is symmetric \(p_p(\mathbf x’|\mathbf x)=p_p(\mathbf x|\mathbf x’)\) and the MH acceptance probability simplifies \(\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(\mathbf x’)}{p(\mathbf x)}\) (in which case this is sometimes just called the Metropolis method). Another tip is that the burn-in period of the MCMC walk can often be used to tune the variance \(\sigma^2\) of the above Gaussian \(p_p\) distribution, namely if the burn-in period lasts for \(N_b\) samples, then look at the number of times \(N_a\) the MCMC walker accepted a proposed state from \(p_p\); a rule-of-thumb is that in high-dimensions \(n\gg 1\), one should tune \(\sigma^2\) such that the fraction of accepted proposals \(N_a/N_b\approx 23.4\%\), see “Weak convergence and optimal scaling of random walk Metropolis algorithms” by Roberts, Gelman, and Gilks (\(1997\)). Finally, one more MCMC trick worth mentioning is that one can always run e.g. \(\sim 100\) MCMC sequences in parallel and sample randomly from them as an alternative to thinning).

Problem: Explain how Gibbs sampling works.

Solution: Whereas a Metropolis-Hastings sampler moves around in state space like a queen in chess, a Gibbs sampler moves around like a rook. That is, it only ever moves along the \(n\) coordinate directions in \(\mathbf x=(x_1,…,x_n)\). Indeed, the Gibbs sampler can be viewed as a special case of a Metropolis-Hastings sampler in which the proposal distribution \(p_p\) is given by:

\[p_p(\mathbf x’|\mathbf x):=p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)\]

where \(i=1,…,n\) is cycled through the \(n\) coordinate directions. For each \(1\leq i\leq n\), the MH acceptance probability is:

\[\frac{p_p(\mathbf x|\mathbf x’)p(\mathbf x’)}{p_p(\mathbf x’|\mathbf x)p(\mathbf x)}=\frac{p(x_i|x’_1,…,x’_{i-1},x’_{i+1},…,x’_n)\delta(x_1-x’_1)…\delta(x_{i-1}-x’_{i-1})\delta(x_{i+1}-x’_{i+1})…\delta(x_n-x’_n)p(\mathbf x’)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)\delta(x’_1-x_1)…\delta(x’_{i-1}-x_{i-1})\delta(x’_{i+1}-x_{i+1})…\delta(x’_n-x_n)p(\mathbf x)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x’_i,…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_n)}\]

\[=\frac{p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}{p(x’_i|x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_1,…,x_{i-1},x_{i+1},…,x_n)p(x_i|x_1,…,x_{i-1},x_{i+1},…,x_n)}\]

\[=1\]

so \(\text{min}(1,1)=1\); in other words, the Gibbs sampler always accepts because the proposal is coming from the conditional probability slices themselves (assumed to be known even if the full joint distribution \(p(\mathbf x)\) is unknown), so they must be authentic samples in a sense. For a Gibbs sampler, only after cycling through all \(n\) coordinate directions \(i=1,…,n\) does it count as \(1\) MCMC iteration.

(aside: a common variant of the basic Gibbs sampling formalism described above is block Gibbs sampling in which one may choose to update \(2\) or more features simultaneously, conditioning on the distribution of those features given all other features fixed; this would enable diagonal movements within that subspace which, depending how correlated those features are, could significantly increase the effective sample size (ESS)).

Problem: Explain how the Hamiltonian Monte Carlo method works.

Solution:

Posted in Blog | Leave a comment

Differential Geometry

Problem: What does it mean for a topological space \(X\) to be locally homeomorphic to a topological space \(Y\)? Hence, what does it mean for a topological space \(X\) to be locally Euclidean?

Solution: \(X\) is said to be locally homeomorphic to \(Y\) iff every point \(x\in X\) has a neighbourhood homeomorphic to an open subset of \(Y\). In particular, \(X\) is said to be locally Euclidean iff it is locally homeomorphic to Euclidean space \(Y=\mathbf R^n\) for some \(n\in\mathbf N\) called the dimension of \(X\).

Explicitly, this means every point \(x\in X\) is associated to at least one neighbourhood \(U_x\) along with at least one homeomorphism \(\phi_x:U_x\to\phi_x(U_x)\subset\mathbf R^n\); the ordered pair \((U_x,\phi_x)\) is an example of a chart on \(X\), and any collection of charts covering \(X\) (such as \(\{(U_x,\phi_x):x\in X\}\)) is called an atlas for \(X\). Hence, \(X\) is locally Euclidean iff there exists an atlas for \(X\).

(caution: there is a distinct concept of a local homeomorphism from \(X\to Y\); existence of such implies that \(X\) is locally homeomorphic to \(Y\), but the converse is false).

Problem: Give an example of topological spaces \(X\) and \(Y\) such that \(X\) is locally homeomorphic to \(Y\) but not vice versa.

Solution: \(X=(0,1)\) and \(Y=[0,1)\). Thus, be wary that local homeomorphicity is not a symmetric relation of topological spaces.

Problem: Let \(X\) be an \(n\)-dimensional locally Euclidean topological space. What additional topological constraints are typically placed on \(X\) in order for it to qualify as a topological \(n\)-manifold?

Solution:

  1. \(X\) is Hausdorff.
  2. \(X\) is second countable.

That being said, conceptually the most important criterion to remember is the local Euclideanity of \(X\). Topological manifolds are the most basic/general/fundamental of manifolds, onto which additional structures may be attached.

Problem: Let \(X\) be a topological \(n\)-manifold. What additional piece of structure should be attached to \(X\) so as to consider it a \(C^k\)-differentiable \(n\)-manifold?

Solution: The additional piece of structure is a \(C^k\)-atlas. A \(C^k\)-atlas for \(X\) is just an atlas for \(X\) with the property that all overlapping charts are compatible with each other. This means that if \((U_1,\phi_1),(U_2,\phi_2)\) are \(2\) charts of a \(C^k\)-atlas for \(X\) such that \(U_1\cap U_2\neq\emptyset\), then the transition map \(\phi_2\circ\phi_1^{-1}:\phi_1(U_1\cap U_2)\subset\mathbf R^n\to\phi_2(U_1\cap U_2)\subset\mathbf R^n\) is \(C^k\)-differentiable (equivalent to \(\phi_1\circ\phi_2^{-1}\) being \(C^k\)-differentiable).

An important point to mention is that a given topological \(n\)-manifold may have multiple different \(C^k\)-atlases. According to the above definition, that would seem to give rise to \(2\) distinct \(C^k\)-manifolds…but intuitively that would be undesirable; they should be viewed as really the same \(C^k\) manifold. This motivates imposing an equivalence relation on \(C^k\)-manifolds by asserting that two \(C^k\)-manifolds are equivalent iff their corresponding \(C^k\)-atlases are compatible…which just means that the union of their \(C^k\)-atlases is itself a \(C^k\)-atlas, or equivalently every chart in one \(C^k\)-atlas is compatible with every chart in the other \(C^k\)-atlas. This is sometimes formalized by introducing a so-called “maximal \(C^k\)-atlas”, but the basic idea should be clear.

Finally, note that topological manifolds are just \(C^0\)-manifolds because homeomorphisms are continuous. In physics, the most common case is to deal with \(C^{\infty}\)-manifolds like spacetime in GR or classical configuration/phase spaces in CM, or state space in thermodynamics, also called smooth manifolds.

Problem: Let \(S^1\) be the circle, a \(1\)-dimensional smooth manifold which is regarded as being embedded in \(\mathbf R^2\) via the set of points \(x^2+y^2=1\). Explain why \(S^1\) is locally (but not globally) homeomorphic to the real line \(\mathbf R\). Furthermore, explain whether or not the pair \((S^1,\theta)\) (where \(\theta:(\cos\theta,\sin\theta)\in S^1\mapsto\theta\in[0,2\pi)\) is the obvious angular coordinate mapping) is a chart on \(S^1\) or not.

Solution: \(S^1\) is locally homeomorphic to \(\mathbf R\) because, roughly speaking, every little strip of open arc in \(S^1\), upon zooming in, looks like a strip of straight line in \(\mathbf R\). However, \(S^1\) is not globally homeomorphic to \(\mathbf R\) because \(S^1\) is compact whereas \(\mathbf R\) is unbounded. The pair \((S^1,\theta)\) is not a chart on \(S^1\) because its image \([0,2\pi)\) isn’t open in \(\mathbf R\).

Problem: Hence, demonstrate an example of an atlas for \(S^1\).

Solution: To circumvent the problem above, one requires \(2\) charts to cover \(S^1\), thereby forming an atlas for \(S^1\). For example, a chart \(\theta_1:S^1-\{(1,0)\}\to (0,2\pi)\) similar to the above but with the point \((1,0)\) removed, and a second chart \(\theta_2:S^1-\{(-1,0)\}\to (-\pi,\pi)\) with the antipodal point \((-1,0)\) removed. Then the domain of these \(2\) charts overlap on the upper and lower semicircles respectively, so the transition function on this overlapping domain is:

\[\theta_2(\theta_1^{-1}(\theta))=\theta[\theta\in(0,\pi)]+(\theta-2\pi)[\theta\in(\pi,2\pi)]\]

which is indeed \(C^{\infty}\).

Problem: Define the real associative algebra \(C^k(X\to\mathbf R)\).

Solution: This is simply the space of all \(C^k\)-differentiable scalar fields on the \(C^k\)-manifold \(X\). To be precise, \(f\in C^k(X\to\mathbf R)\) iff for all charts \((U,\phi)\) in the \(C^k\)-atlas defining the differential structure of \(X\), the map \(f\circ\phi^{-1}:\phi(U)\to\mathbf R\) is \(C^k\)-differentiable.

Problem: What is a tangent vector to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

Solution: In the algebraic formulation, a tangent vector is simply any \(\mathbf R\)-derivation on \(C^1(X\to\mathbf R)\) at a specific point \(x\in X\) on the manifold. Concretely, this means that a tangent vector at \(x\in X\) is any real-valued function \(\partial_{\mu}|_x:C^1(X\to\mathbf R)\to\mathbf R\) which is linear and satisfies the product rule:

\[\partial_{\mu}|_x(\alpha f+\beta g)=\alpha\partial_{\mu}|_x(f)+\beta\partial_{\mu}|x(g)\]

\[\partial_{\mu}|_x(fg)=f(x)\partial_{\mu}|_x(g)+\partial_{\mu}|_x(f)g(x)\]

A more intuitive but equivalent formulation of what “tangent vector” means is to consider trajectories \(\gamma(t)\in X\) through the manifold \(X\) passing through the point \(x\in X\) at time \(t\). Then, choosing an arbitrary chart \((U,\phi)\) in a neighbourhood of \(x\in U\), one considers the coordinatized version of \(\gamma(t)\in U\subset X\), namely \(\phi(\gamma(t))\in\mathbf R^n\) which will be differentiable with some derivative \(\frac{d}{dt}\phi(\gamma(t))\in\mathbf R^n\). All curves \(\gamma(t)\) with the same derivative at \(x\) are considered to form an equivalence class, and the value of that derivative is then effectively the tangent vector. Unlike the algebraic formulation, here one has to check that this definition is independent of the choice of neighbourhood chart \((U,\phi)\) of \(x\in U\subset X\).

Problem: What is the tangent space \(T_x(X)\) to a \(C^1\)-manifold \(X\) at a point \(x\in X\)?

Solution: \(T_x(X)\) is simply the set of all tangent vectors to \(X\) at \(x\in X\). As the name of its elements (i.e. “tangent vectors“) suggests, \(T_x(X)\) is a real, \(n\)-dimensional vector space. By choosing a neighbourhood chart \((U,\phi)\) containing \(x\in U\), this provides \(n\) local coordinates \(\phi(x)=(x^0(x),…,x^{n-1}(x))\) on \(U\). An explicit basis for \(T_x(X)\) then consists of the \(n\) partial derivative tangent vector operators \(\partial_{\mu}|_x\in T_x(X)\) for \(\mu=0,…,n-1\) defined by:

\[\partial_{\mu}|_x(f):=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}\]

(which is well-defined because \(f\in C^1(X\to\mathbf R)\)). Any tangent vector \(v_x\in T_x(X)\) at \(x\in X\) can then be written as a linear combination of these basis tangent vectors:

\[v_x=v^{\mu}\partial_{\mu}|_x\]

with real coefficients \(v^{\mu}\in\mathbf R\) called the contravariant components of \(v_x\in T_x(X)\) in the coordinate basis \(\partial_{\mu}|_x\).

Problem: A given tangent vector \(v_x\in T_x(X)\) may be written with contravariant components in \(2\) distinct coordinate bases:

\[v_x=v^{\mu}\partial_{\mu}|_x=v’^{\nu}\partial’_{\nu}|_x\]

simply from picking \(2\) different charts \((U,\phi),(U’,\phi’)\) in the \(C^1\)-atlas of \(X\) containing \(x\in U\cap U’\). Describe how the contravariant components \(v’^{\nu}\) may be obtained from the contravariant components \(v^{\mu}\).

Solution: Act on an arbitrary \(f\in C^1(X\to\mathbf R)\) to obtain:

\[v_x(f)=v^{\mu}\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=v’^{\nu}\frac{\partial (f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

Now insert a “resolution of the identity” \(f\circ\phi’^{-1}\circ\phi’\circ\phi^{-1}\) and because \(X\) is a \(C^1\)-manifold, the transition map \(\phi’\circ\phi^{-1}\) will be differentiable and in particular, by the chain rule:

\[\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}\biggr|_{\phi(x)}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}\frac{\partial(f\circ\phi’^{-1})}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\]

where \(\phi'(x)=(x’^0(x),…,x’^{n-1}(x))\). This simple chain rule identity can by itself already be viewed as a passive change of \(T_x(X)\)-basis:

\[\frac{\partial}{\partial x’^{\nu}}\biggr|_{x}=\frac{\partial x^{\mu}}{\partial x’^{\nu}}\biggr|_{\phi'(x)}\frac{\partial}{\partial x^{\mu}}\biggr|_x\]

Or equivalently, as an active change of the contravariant components via the Jacobian matrix:

\[v’^{\nu}=\frac{\partial x’^{\nu}}{\partial x^{\mu}}\biggr|_{\phi(x)}v^{\mu}\]

Problem: What is a vector field \(v\) on a \(C^1\)-manifold \(X\)?

Solution: The more intuitive way to understand it is as a map \(v:X\to TX\), where \(TX:=\bigsqcup_{x\in X}T_x(X)\) is the so-called tangent bundle of \(X\). This basically corresponds to assigning a tangent vector \(v_x\in T_x(X)\) rooted at each point \(x\in X\) across the manifold \(X\). Alternatively, a more algebraic definition of a vector field is as a map from scalar fields to scalar fields \(v:C^1(X\to\mathbf R)\to C^1(X\to\mathbf R)\), defined so that for a scalar field \(f:X\to\mathbf R\) on the manifold \(X\), the new scalar field \(v(f):X\to\mathbf R\) can be thought of as a global directional derivative map \(v(f)(x):=v_x(f)\) giving the directional derivative \(v_x(f)\in\mathbf R\) of the scalar field \(f\) at each point \(x\in X\) according to some assignment \(v_x\in T_x(X)\) of tangent vectors (in the spirit of the first definition).

Problem: Given a \(C^1\)-manifold \(X\), and a neighbourhood of a point \(x\in X\) with local coordinates \((x^0,…,x^{n-1})\), explain how a vector field \(v\) may be expressed in this coordinate basis over said neighbourhood.

Solution: \(v=v^{\mu}\partial_{\mu}\) is essentially a directional derivative operator. Just to be clear, there is a standard abuse of notation going on here where the underlying chart \(\phi\) defining the local coordinates \((x^0,…,x^{n-1})\) is swept under the rug, so for instance \(v(f)(x)=v^{\mu}(x)\frac{\partial f}{\partial x^{\mu}}(x)\) where the shorthand \(\frac{\partial f}{\partial x^{\mu}}(x)=\frac{\partial (f\circ\phi^{-1})}{\partial x^{\mu}}(\phi(x))\) is being used.

Problem: Given \(2\) vector fields \(v,w\) on the same \(C^2\)-manifold \(X\), explain why the composition \(v\circ w\) (and hence also \(w\circ v\)) is not a vector field on \(X\).

Solution: Although it’s linear, it fails to satisfy the product rule.

Problem: Explain how the above situation can be remedied.

Solution: Consider instead the commutator \([v,w]:C^2(X\to\mathbf R)\to C^2(X\to\mathbf R)\). This is now both linear, but also importantly obeys the product rule:

\[[v,w](fg)=f[v,w](g)+g[v,w](f)\]

Indeed, as with any commutator this makes the space of all vector fields not only a real vector space but also a Lie algebra. In a common coordinate basis \((x^0,…,x^{n-1})\) where \(v=v^{\mu}\partial_{\mu}\) and \(w=w^{\nu}\partial_{\nu}\), the commutator has the representation:

\[[v,w]=\left(v^{\mu}\partial_{\mu}w^{\nu}-w^{\mu}\partial_{\mu}v^{\nu}\right)\partial_{\nu}\]

Posted in Blog | Leave a comment

Support Vector Machines

Problem: Explain how a hard-margin support vector machine would perform binary classification.

Solution: Conceptually, it’s simple. Given a training set of \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) each associated with some binary target label \(y_1,…,y_N\in\{-1,1\}\) (notice the \(2\) binary classes are \(y=-1\) and \(y=1\) rather than the more conventional \(y=0\) and \(y=1\); this is simply a matter of convenience), the goal of an SVM is to compute the unique hyperplane in \(\mathbf R^n\) that will separate the \(y=-1\) and \(y=1\) classes as “best” as possible. For \(n=2\) this can be visualized as trying to find the “widest possible street” between \(2\) neighborhoods:

Mathematically, if the hyperplane decision boundary \(\mathbf w\cdot\mathbf x+b=0\) is defined by a normal vector \(\mathbf w\in\mathbf R^n\) such that the SVM classifies points with \(\mathbf w\cdot\mathbf x+b\geq 0\) as \(y=1\) and \(\mathbf w\cdot\mathbf x+b\leq 0\) as \(y=-1\) (more compactly, \(\hat y_{\text{SVM}}(\mathbf x|\mathbf w,b)=\text{sgn}(\mathbf w\cdot\mathbf x+b)\)), and the overall normalization of \(\mathbf w\) and \(b\) is fixed by stipulating that the “gutters” of the street are defined by the hyperplanes \(\mathbf w\cdot\mathbf x+b=\pm 1\) passing through the support (feature) vectors (drawn as bigger red/green dots in the diagram), then the “width of the street” is the distance between these \(2\) “gutter hyperplanes”, namely \(2/|\mathbf w|\). Since one would like to maximize this margin \(2/|\mathbf w|\), this is equivalent to minimizing the quadratic \(|\mathbf w|^2/2\). Thus, one can formulate this “hard-margin SVM” algorithm as the programming problem:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R:\forall i=1,…,N,y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1}\frac{|\mathbf w|^2}{2}\]

where the \(N\) constraints \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1\) (or vectorially \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1\)) are just saying that all \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\) in the training set should lie on the correct side of the decision boundary hyperplane \(\mathbf w\cdot\mathbf x+b=0\).

Problem: What is the glaring problem with a hard-margin SVM? How does a soft-margin SVM help to address this issue?

Solution: The basic problem is that the hard-margin SVM is way too sensitive to outliers, e.g. a single \(y=-1\) feature vector amidst the \(y=1\) neighborhood would cause the hard-margin SVM to be unable to find a solution (because indeed there would be no hyperplane that perfectly separates all the training data into their correct classes). Intuitively, it seems like in such a case the best way to deal with this bias-variance tradeoff is to shrug and ignore the outliers:

Mathematically, the usual way to encode this is by introducing \(N\) “slack variables” \(\xi_1,…,\xi_N\geq 0\) such that the earlier \(N\) constraints are relaxed to \(y_i(\mathbf w\cdot\mathbf x_i+b)\geq 1-\xi_i\). However, to avoid cutting too much slack (i.e. incurring too much misclassification on the training set), their use should be penalized by the \(\ell^1\)-norm \(|\boldsymbol{\xi}|_1:=\boldsymbol{\xi}\cdot\mathbf{1}=\sum_{i=1}^N\xi_i\). To this effect, the programming problem is modified to that of soft-margin SVM:

\[\text{argmin}_{\mathbf w\in\mathbf R^n,b\in\mathbf R,\boldsymbol{\xi}\in\mathbf R^N:\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi},\boldsymbol{\xi}\geq\mathbf 0}\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1\]

where \(C\geq 0\) is a “strictness” hyperparameter that must be selected ahead of time to balance hard vs. soft SVM.

Problem: The above represents the primal formulation of the programming problem in terms of so-called primal variables \(\mathbf w,b,\boldsymbol{\xi}\). Show how the KKT conditions enable one to obtain a corresponding dual formulation of the programming problem in terms of a dual variable \(\boldsymbol{\lambda}\):

\[\text{argmax}_{\boldsymbol{\lambda}\in\mathbf R^N:\boldsymbol{\lambda}\cdot\mathbf y=0,\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1}|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

Solution: The idea is to combine the \(2\) constraints \(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})\geq\mathbf 1-\boldsymbol{\xi}\) and \(\boldsymbol{\xi}\geq\mathbf 0\) using \(2\) KKT multipliers \(\boldsymbol{\lambda},\boldsymbol{\mu}\) into the Lagrangian:

\[L(\mathbf w,b,\boldsymbol{\xi},\boldsymbol{\lambda},\boldsymbol{\mu})=\frac{|\mathbf w|^2}{2}+C|\boldsymbol{\xi}|_1-\boldsymbol{\lambda}\cdot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})-\boldsymbol{\mu}\cdot\boldsymbol{\xi}\]

The KKT necessary conditions assert:

\[\frac{\partial L}{\partial\mathbf w}=\mathbf 0\Rightarrow\mathbf w=X(\boldsymbol{\lambda}\odot\mathbf y)\]

\[\frac{\partial L}{\partial b}=0\Rightarrow\boldsymbol{\lambda}\cdot\mathbf y=0\]

\[\frac{\partial L}{\partial\boldsymbol{\xi}}=\mathbf 0\Rightarrow C\mathbf 1=\boldsymbol{\lambda}+\boldsymbol{\mu}\]

in addition to the \(2\) original constraints on the primal variables (primal feasibility), and dual feasibility \(\boldsymbol{\lambda},\boldsymbol{\mu}\geq\mathbf 0\) on both dual variables (as a corollary, \(\mathbf 0\leq\boldsymbol{\lambda}\leq C\mathbf 1\)), and complementary slackness \(\boldsymbol{\lambda}\odot(\mathbf y\odot(X^T\mathbf w+b\mathbf{1})-\mathbf 1+\boldsymbol{\xi})=\boldsymbol{\mu}\odot\boldsymbol{\xi}=\mathbf 0\). The idea of the support vectors is clearly seen in the \(1^{\text{st}}\) of these complementary slackness conditions.

By substituting the stationary conditions found back into the Lagrangian, one can eliminate all the primal variables (and even eliminate one of the dual variables \(\boldsymbol{\mu}\)) to obtain the claim:

\[L(\boldsymbol{\lambda})=|\boldsymbol{\lambda}|_1-\frac{1}{2}(\boldsymbol{\lambda}\odot\mathbf y)X^TX(\boldsymbol{\lambda}\odot\mathbf y)\]

which is a standard quadratic programming exercise with efficient solutions for \(\boldsymbol{\lambda}\).

(mention something about strong duality as enabling a min-max interchange of primal vs. dual?)

Problem: Although soft-margin SVM is a significant improvement over hard-margin SVM, it too has a glaring issue. What is that issue and how can it be addressed?

Solution: Whether hard-margin or soft-margin, SVMs are ultimately linear binary classifiers; they can only directly learn a hyperplane decision boundary. Therefore, if the data is simply linearly inseparable, then even soft-margin SVM would be unwise to apply directly.

However, there is a creative solution; just take the \(N\) feature vectors \(\mathbf x_1,…,\mathbf x_N\in\mathbf R^n\) and apply a nonlinear transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) from the current feature space \(\mathbf R^n\) to some higher-dimensional feature space \(\mathbf R^{>n}\) (essentially a kind of feature engineering), thereby obtaining \(N\) transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\). Ideally, if the nonlinear transformation \(\prime\) is well-chosen, then the transformed feature vectors \(\mathbf x’_1,…,\mathbf x’_N\) will become linearly separable in the higher-dimensional feature space \(\mathbf R^{>n}\), in which case the usual soft-margin SVM may be applied in \(\mathbf R^{>n}\). Then, applying the inverse nonlinear mapping \(\prime^{-1}:\mathbf R^{>n}\to\mathbf R^n\) back to the original feature space \(\mathbf R^n\), the SVM (soft) maximal-margin hyperplane in \(\mathbf R^{>n}\) would be transformed into some nonlinear decision boundary back in \(\mathbf R^n\), thereby effectively extending the applicability of the SVM method to linearly inseparable data! See this YouTube video for an example.

Problem: Continuing off the previous problem, suppose one is already in the transformed space with feature vectors \(\mathbf x’_1,…,\mathbf x’_N\in\mathbf R^{>n}\), and one has solved the dual programming problem in this space to obtain some optimal \(\boldsymbol{\lambda}\in\mathbf R^N\). How is binary classification thus performed?

Solution: Based on the previous analysis, the SVM binary classifier in this transformed feature space \(\mathbf R^{>n}\) may be written:

\[\hat y_{\text{SVM}}(\mathbf x’)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_i\mathbf x’\cdot\mathbf x’_i+y_S-\sum_{i=1}^N\lambda_iy_i\mathbf x’_S\cdot\mathbf x’_i\right)\]

where \(\mathbf x’_S\) is any support feature vector. Note by complementary slackness that most \(\lambda_i=0\) are vanishing (corresponding non-support feature vectors \(\mathbf x_i\)), so the sums \(\sum_{i=1}^N\) may also be written as sums over only support vectors.

Problem: Explain how the expression above can be simplified by means of the kernel trick.

Solution: Basically, the feature space transformation \(\prime:\mathbf R^n\to\mathbf R^{>n}\) is a complicated mapping of vectors to vectors. By contrast, only their scalar dot products \(\mathbf x’\cdot\mathbf x’_i\) appear in the SVM classifier \(\hat y_{\text{SVM}}(\mathbf x)\), and scalars are simpler than vectors. The kernel trick essentially invites one to forget about the details of the transformation \(\prime\) and just directly specify an analytical expression for the dot product in the transformed feature space \(K(\mathbf x,\tilde{\mathbf x}):=\mathbf x’\cdot\tilde{\mathbf x}’\), where here \(K:\mathbf R^n\times\mathbf R^n\to\mathbf R\) is called the kernel function. Effectively, this means one never has to actually visit the transformed feature space \(\mathbf R^{>n}\) since the kernel is perfectly happy to just take the feature vectors in the original space \(\mathbf R^n\). Thus, one can write:

\[\hat y_{\text{SVM}}(\mathbf x)=\text{sgn}(\mathbf w\cdot\mathbf x’+b)=\text{sgn}\left(\sum_{i=1}^N\lambda_iy_iK(\mathbf x,\mathbf x_i)+y_S-\sum_{i=1}^N\lambda_iy_iK(\mathbf x_S,\mathbf x_i)\right)\]

Problem: What is the Gaussian (radial basis function) kernel \(K_{\sigma}(\mathbf x,\tilde{\mathbf x})\)? Sketch why it’s a valid kernel function.

Solution: This is by far the most popular kernel for nonlinear SVM classification, involving a hyperparameter \(\sigma\):

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x-\tilde{\mathbf x}|^2/2\sigma^2}\]

algebraically, it is clearly equivalent to:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}e^{\mathbf x\cdot\tilde{\mathbf x}/\sigma^2}\]

Taylor expanding the last factor:

\[K_{\sigma}(\mathbf x,\tilde{\mathbf x})=e^{-|\mathbf x|^2/2\sigma^2}e^{-|\tilde{\mathbf x}|^2/2\sigma^2}\sum_{d=0}^{\infty}\frac{(\mathbf x\cdot\tilde{\mathbf x})^d}{\sigma^{2d}d!}\]

so it’s clearly a (countably-infinite-dimensional!) inner product because of the presence of the \(d\)-dimensional polynomial kernels \(K_d(\mathbf x,\tilde{\mathbf x})=(\mathbf x\cdot\tilde{\mathbf x})^d\).

Problem: What were the main problems with SVMs that spurred the deep learning renaissance via the development of artificial neural networks?

Solution: The main issue with SVMs is their lack of scalability. Roughly speaking, the computational cost of training an SVM on a dataset of \(N\) training examples scales like \(O(N^2)\) in the best case (essentially because the Gram matrix \(X^TX\) is an \(N\times N\) matrix of all pairwise interactions of features), even with modern tricks like sequential minimal optimization this is still unavoidable. Another smaller issue is that

Problem: Write a short Python program using the scikit-learn library to train a support vector machine on the Kaggle Titanic competition dataset.

Solution:

titanic
Posted in Blog | Leave a comment

Numerical Computation

Problem: In numerical computation, what are the \(2\) main kinds of rounding error?

Solution: Overflow error (\(N\approx\infty\)) but perhaps even more dangerous is underflow error (\(\varepsilon\approx 0\)) which are in some sense inverses of each other:

\[0=\frac{1}{\infty}\]

Problem: Explain how rounding errors can affect evaluation of the softmax activation function and strategies to address it.

Solution: Given a vector \(\mathbf x\in\mathbf R^n\), the corresponding softmax probability \(\ell^1\)-unit vector is \(e^{\mathbf x}/|e^{\mathbf x}|_1\in S^{n-1}\). If one of the components of \(\mathbf x\) is very negative, then the corresponding component of \(e^{\mathbf x}\) can underflow, whereas if it is instead very positive, then overflow becomes a possibility.

To address these numerical stability issues, the trick is to notice that softmax is invariant under “diagonal” translations along \(\mathbf 1\in\mathbf R^n\), i.e. \(\mathbf x\mapsto\mathbf x-\lambda\mathbf 1\) leads to \(e^{\mathbf x-\lambda\mathbf 1}=e^{\mathbf x}\odot e^{-\lambda\mathbf 1}=e^{-\lambda}e^{\mathbf x}\) and hence the \(\ell^1\)-norm \(|e^{\mathbf x-\lambda\mathbf 1}|_1=e^{-\lambda}|e^{\mathbf x}|_1\) is also merely scaled by the same factor \(e^{-\lambda}\). Choosing \(\lambda:=\text{max}(\mathbf x)\) to be maximum component of \(\mathbf x\) (which coincides with the \(\ell^{\infty}\) norm if \(\mathbf x\geq 0\) is positive semi-definite) would eliminate any overflow error that may have been present in evaluating the numerator \(e^{\mathbf x}\) by instead evaluating \(e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}\). For the same reason, the shifted denominator \(|e^{\mathbf x-\text{max}(\mathbf x)\mathbf 1}|_1\) is also immune to overflow error. Furthermore, the denominator is also resistant to underflow error because at least one term in the series for \(|e^{\mathbf x-\lambda\mathbf 1}|_1=\sum_{i}e^{x_i-\lambda}\) will be \(e^0=1\gg 0\), namely term \(i=\text{argmax}(\mathbf x)\). The only possible problem is that the modified numerator could still experience an underflow error; this would be bad if subsequently one were to evaluate the information content (loss function) of a softmax outcome in which taking \(\log(0)=-\infty\) would produce a nan. For this case, one can apply a similar trick to numerically stabilize the computation of this logarithm.

Problem: Describe the numerical analysis phenomenon of catastrophic cancellation.

Solution: Subtraction \(f(x,y):=x-y\) is ill-conditioned when \(x\approx y\). This is because even if \(x\) and \(y\) have small relative errors, the relative error in their difference \(f(x,y)\) can still be substantial.

Problem: What does the condition number of a matrix \(X\) measure?

Solution: The condition number measures how numerically stable it is to invert \(X\mapsto X^{-1}\). It is given by the ratio of the largest to smallest eigenvalues of \(X\) (by absolute magnitude, assuming \(X\) is diagonalizable):

\[\frac{\text{max}|\text{spec}(X)|}{\text{min}|\text{spec}(X)|}\]

For instance, a non-invertible matrix would have \(\text{min}|\text{spec}(X)|=0\) and thus an infinite condition number, since it’s non-invertible.

Problem: Show that during gradient descent of a scalar field \(f(\mathbf x)\), if at some point \(\mathbf x\) in the optimization landscape, one has \(\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}>0\) (e.g. such as if the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) is just generally positive-definite or the gradient \(\frac{\partial f}{\partial\mathbf x}\) is an eigenvector of the Hessian \(\frac{\partial^2 f}{\partial\mathbf x^2}\) with positive eigenvalue), then by a line search logic, the most effective local learning rate \(\alpha(\mathbf x)>0\) is given by:

\[\alpha(\mathbf x)=\frac{\left|\frac{\partial f}{\partial\mathbf x}\right|^2}{\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}}\]

Solution: Since gradient descent takes a step \(\mathbf x\mapsto\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\), and the goal is to decrease the value of the function \(f\) as much as possible to get closer to the minimum, one would like to maximize:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\]

as a function of the learning rate \(\alpha>0\). Assuming that a quadratic approximation is valid:

\[f(\mathbf x)-f\left(\mathbf x-\alpha\frac{\partial f}{\partial\mathbf x}\right)\approx \alpha\left|\frac{\partial f}{\partial\mathbf x}\right|^2-\frac{\alpha^2}{2}\frac{\partial f}{\partial\mathbf x}\cdot\frac{\partial^2 f}{\partial\mathbf x^2}\frac{\partial f}{\partial\mathbf x}\]

which has a well-defined maximum of the \(\alpha\)-parabola provided the \(\alpha^2\) coefficient is positive.

Problem: Explain why gradient descent is behaving like this in this case? What can be done to address that?

Solution: The problem is that the Hessian is ill-conditioned (in this case the condition number is \(5\) everywhere); the curvature in the \((1,1)\) direction is \(5\times\) greater than the curvature in the \((1,-1)\) direction. Gradient descent wastes time repeatedly descending “canyon walls” because it sees those as the steepest features. A \(2^{\text{nd}}\)-order optimization algorithm (e.g. Newton’s method) that uses information from the Hessian would do better.

Problem: Explain how the Karush-Kuhn-Tucker (KKT) conditions generalize the method of Lagrange multipliers.

Solution: The idea is, in addition to equality constraints like \(x+y=1\), one can also allow for inequality constraints like \(x^2+y^2\leq 1\) (indeed, an equality \(a=b\) is just \(2\) inequalities \(a\leq b\) and \(a\geq b\)). Then, constrained optimization of a scalar field \(f(\mathbf x)\) subject to a bunch of equality constraints \(g_1(\mathbf x)=g_2(\mathbf x)=…=0\) and a bunch of inequality constraints \(h_1(\mathbf x),h_2(\mathbf x),…\leq 0\) is equivalent to unconstrained optimization of the Lagrangian:

\[L(\mathbf x,\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h})=f(\mathbf x)+\boldsymbol{\lambda}_{\mathbf g}\cdot\mathbf g(\mathbf x)+\boldsymbol{\lambda}_{\mathbf h}\cdot\mathbf h(\mathbf x)\]

where \(\boldsymbol{\lambda}_{\mathbf g},\boldsymbol{\lambda}_{\mathbf h}\) are called KKT multipliers. The KKT necessary (but not sufficient!) conditions for optimality are then (ignoring technicalities about equating objects of possibly different dimensions):

  1. \(\frac{\partial L}{\partial\mathbf x}\mathbf 0\) (stationary)
  2. \(\mathbf g(\mathbf x)=\mathbf 0\geq \mathbf h(\mathbf x)\) (primal feasibility)
  3. \(\boldsymbol{\lambda}_{\mathbf h}\geq 0\) (if the inequalities were instead written as \(\mathbf h(\mathbf x)\geq\mathbf 0\) then this would instead require \(\boldsymbol{\lambda}_{\mathbf h}\leq 0\)) (dual feasibility)
  4. \(\boldsymbol{\lambda}_{\mathbf h}\odot\mathbf h(\mathbf x)=\mathbf 0\) (complementary slackness)
Posted in Blog | Leave a comment

Information Theory & Inference

Problem: Draw a schematic of a binary symmetric channel (BSC) with bit flip probability \(p_f\).

Solution: Classically, one has:

On the other hand, taking a more quantum perspective, in the computational basis \((|0\rangle,|1\rangle)\) one might define a binary symmetric channel as a kind of quantum logic gate:

\[\text{BSC}_{p_f}=\begin{pmatrix}\sqrt{1-p_f}&\sqrt{p_f}\\\sqrt{p_f}&\sqrt{1-p_f}\end{pmatrix}\]

The word “symmetric” in the name emphasizes that the bit flip probability from \(0\to 1\) is identical to the bit flip probability \(1\to 0\).

Problem: Consider the radio communication link between Galileo and Earth, and suppose that the noisy communication channel between them may be modelled as a BSC with bit flip probability \(p_f=0.01\). In a sequence of \(N=10^5\) bit transmissions, what is the probability that the error rate is less than \(1\%\)?

Solution: This means that one requires \(N'<0.01N=10^3\) bit flips. Each such \(N’\) has binomial probability \({{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\) so the total probability is given by the cumulative distribution function:

\[\sum_{N’=0}^{N’=999}{{N}\choose{N’}}p_f^{N’}(1-p_f)^{N’}\]

Problem: What are the \(2\) solutions for addressing noise in communication channels? Which one is preferable?

Solution:

  1. The physical solution: build more reliable hardware, cooling circuits, etc. (basically all the stuff one learns in a typical experimental physics class).
  2. The system solution: accept the noisy communication channel as is, and add communication systems to it so as to detect and correct errors. This is the goal of coding theory.

The system solution is preferable because it only comes at an increased computational cost whereas the former comes at an increased financial cost. But more importantly, the improvements arising from the system solution can be very dramatic, unlike the incremental improvements of the physical solution.

Problem: What is the distinction between information theory and coding theory?

Solution: The difference between information theory and coding theory is analogous to the distinction between theoretical physics and experimental physics; the former is concerned with the theoretical limitations of the aforementioned communication systems (e.g. “what is the best error-correcting performance possible over the space of all possible error-correcting algorithms?”) whereas the latter is interested in specifically designing practical such algorithms (also called codes) for error correction.

Problem: Draw a schematic that depicts the high-level structure of an error-correcting code.

Solution: Coding = Encoding + Decoding:

Here \(\mathbf s\) is the source message vector which is encoded as a transmitted message vector \(\mathbf t=\mathbf s+\textbf{redundancy}\). After transmission across the noisy communication channel, the transmitted message vector has been distorted into \(\tilde{\mathbf t}=\mathbf t+\textbf{noise}\). This is then decoded and received as a message vector \(\tilde{\mathbf s}\) with the hope that \(\tilde{\mathbf s}=\mathbf s\) with probability close to \(1\).

Problem: Consider a first naive attempt at designing an error-correcting code for a BSC, known as a repetition code. Explain how that works with an example.

Solution: A choice of error-correcting code basically amounts to a choice of encoder + decoder, since again coding = encoding + decoding. For the repetition code, the encoder is literally a repetition of each bit a predefined number of times, say \(3\) in the example below. The corresponding optimal decoder turns out to be the obvious “majority-vote” among the triplets (assuming \(p_f<1/2\); if it were the case that \(p_f>1/2\) then instead a “minority-vote” decoder would be better!)

(aside: “optimal” means the decoder should map each \(\tilde{\mathbf t}\) to \(\tilde{\mathbf s}=\text{argmax}_{\mathbf s}p(\mathbf s|\tilde{\mathbf t})\) and in particular as part of computing this conditional probability, another implicit assumption is that both \(p(0)=p(1)=1/2\) have equal prior probabilities)

Here \(\mathbf n\) is a spare noise vector each of whose bits are sampled from a \(p_f\)-Bernoulli distribution, where \(0\) means no bit flip and \(1\) means yes bit flip, and \(\tilde{\mathbf t}\equiv\mathbf t+\mathbf n\pmod{2}\), or using the xor notation, \(\tilde{\mathbf t}=\mathbf t\oplus\mathbf n\)

Problem: Show that, compared to no repetition for which the bit flip probability across the BSC is \(p_f<1/2\), the \(3\)-fold repetition code has an effective bit flip probability \(p^{\text{eff}}_f<p_f\) which is reduced compared to the original bit flip probability \(p_f\).

Solution: In order to get a bit flip, either \(2\) or all \(3\) bits in the \(3\)-fold repetition encoded vector \(\mathbf t\) have to be flipped. Thus, the effective bit flip probability \(p^{\text{eff}}_f\) is a cubic function of \(p_f\):

\[p^{\text{eff}}_f=3p_f^2(1-p_f)+p_f^3=3p_f^2-2p_f^3\]

In order to have \(p^{\text{eff}}_f<p_f\), this requires:

\[3p_f-2p_f^2<1\Rightarrow 2(p_f-1/2)(p_f-1)>0\]

which is satisfied for \(p_f<1/2\).

Problem: Instead of \(N=3\) repetitions, generalize the above expression for the effective bit flip probability \(p^{\text{eff}}_f\) to an arbitrary odd number \(N\) of repetitions.

Solution: The effective bit flip probability \(p^{\text{eff}}_f\) is now a degree-\(N\) polynomial in \(p_f\):

\[p^{\text{eff}}_f=\sum_{n=(N+1)/2}^N{N\choose n}p_f^n(1-p_f)^{N-n}\]

which for \(p_f\ll 0.5\) is dominated by the first term \(n=(N+1)/2\).

Problem: Describe how the \((7,4)\) Hamming error correction code works, and explain why it’s an example of a linear block code.

Solution: The idea is that if the source vector \(\mathbf s\) has length \(L_{\mathbf s}\) and the encoded transmitted vector \(\mathbf t\) has some larger length \(L_{\mathbf t}>L_{\mathbf s}\), then one has the “rate” \(L_{\mathbf s}/L_{\mathbf t}=4/7\). Specifically, for every \(4\) source bits \(\mathbf s=s_1s_2s_3s_4\), the transmitted vector will contain \(7\) bits \(\mathbf t=t_1t_2t_3t_4t_5t_6t_7\) such that \(t_{1,2,3,4}:=s_{1,2,3,4}\) and \(t_{5,6,7}\) are uniquely specified by the requirements that \(t_1+t_2+t_3+t_5\equiv t_2+t_3+t_4+t_6\equiv t_1+t_3+t_4+t_7\equiv 0\pmod 2\); in other words, \(t_{5,6,7}\) are said to be parity-check bits in the sense that they check the parity of a certain subset of the source bits \(s_{1,2,3,4}\). This encoding can be depicted visually as such:

where the parity of each of the \(3\) circles has to be even. For example, if \(\mathbf s=1011\), then the Hamming encoding of this would be \(\mathbf t=1011001\).

As for the \((7,4)\) Hamming decoder, for generic \(p_f<1/2\) it can be a bit tricky but if one assumes \(p_f\ll 1/2\), then it is pretty safe to assume that there will be at most \(1\) bit flip during transmission across the noisy binary symmetric communication channel \(\mathbf t\mapsto\tilde{\mathbf t}\). In that case, one can take a given \(\tilde{\mathbf t}\) and compute its syndrome vector \(\mathbf z\); i.e. in each of the \(3\) circles, has the parity remained even \((0)\) or has it become odd \((1)\)? For instance, getting a syndrome \(\mathbf z=000\) means that most likely no bits were flipped and thus there are no errors! Or, if \(\mathbf z=011\), then under the premise that there was only a single bit flip, it must have been \(s_4\), etc. In general, \(\tilde{\mathbf t}\) can have \(8\) possible syndromes \(\mathbf z\), and each such syndrome \(\mathbf z\) maps uniquely onto the \(7\) possible single bit flips together with the possibility of zero bit flips.

The Hamming code is a block code because the parity-check bits look at whole blocks (or subsets) of the source bitstring \(\mathbf s\), rather than one bit of \(\mathbf s\) at a time (as was essentially done in repetition coding). The Hamming code is also said to be a linear code (mod \(2\)) because the encoding operation is linear in the sense that (assuming \(\mathbf s,\mathbf t\) are both column vectors):

\[\mathbf t=G^T\mathbf s\]

where the \(7\times 4\) generator matrix is given by:

\[G^T=\begin{pmatrix}1&0&0&0\\0&1&0&0\\0&0&1&0\\0&0&0&1\\1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\]

Or transposing both sides, \(\mathbf t^T=\mathbf s^TG\), and in some coding theory texts one often takes \(\mathbf s,\mathbf t\) to be row vectors by default in which case this would just be written \(\mathbf t=\mathbf sG\). Of course, the columns of \(G^T\) represent the encodings \(\mathbf t\) of the “basis” bitstrings \(\mathbf s=1000,0100,0010,0001\).

Problem: The \((7,4)\) Hamming decoder \(\tilde{\mathbf t}\to\mathbf z\to\tilde{\mathbf s}\) uses a syndrome vector \(\mathbf z\) as an intermediate in the decoding process. Show that the map from \(\tilde{\mathbf t}\to\mathbf z\) is linear in the sense that:

\[\mathbf z=H\tilde{\mathbf t}\]

where the parity-check matrix \(H:=(P,1_3)\), and \(P=\begin{pmatrix}1&1&1&0\\0&1&1&1\\1&0&1&1\end{pmatrix}\) also appears in the generator matrix via:

\[G^T=\begin{pmatrix}1_4\\P\end{pmatrix}\]

Show that \(HG^T=0\) and hence if there were no errors such that \(\tilde{\mathbf t}=\mathbf t\), then the corresponding syndrome vector \(\mathbf z=\mathbf 0\Leftrightarrow\mathbf t\in\text{ker}H\). Hence, explain why a maximum-likelihood decoder is equivalent to finding the most probable noise vector \(\mathbf{n}\) for which \(\mathbf z=H\mathbf n\).

Solution: Just check it, it’s true. And clearly \(HG^T=2P=0\) because \(2\equiv 0\pmod 2\). It follows then that \(\mathbf t\in\text{ker}H\) because \(\mathbf t=G^T\mathbf s\). Thus, because \(\tilde{\mathbf t}=\mathbf t+\mathbf n\), acting on both sides with \(H\) results in \(\mathbf z=H\mathbf n\).

Problem: As the \((7,4)\) Hamming code is a block code, given a possibly very long source message, the only way to implement the code is to chop up the message into blocks of length \(L_{\mathbf s}=4\); naturally, one can ask, for any given block in this “assembly line”, what’s the probability \(p(\tilde{\mathbf s}\neq\mathbf s)\) of it being decoded incorrectly? By contrast, a different question one can ask is what is the effective bit flip probability \(p_f^{\text{eff}}\) within one of these blocks? (give both to leading order in \(p_f\))

Solution: First, notice that without the Hamming code, any odd number of bit flips as well as certain combinations of an even number of bit flips would lead very easily to a decoding error \(\tilde{\mathbf s}\neq\mathbf s\); as \(p_f\ll 1/2\), the dominant process is having a single bit flip, so in this case \(p(\tilde{\mathbf s}\neq\mathbf s)={4\choose{1}}p_f(1-p_f)^3+…=4p_f+O(p_f^2)\).

By contrast, with the \((7,4)\) Hamming code, the block error probability is no longer linear in \(p_f\), but quadratic. This is simply because the \((7,4)\) Hamming code is designed to correct all \(1\)-bit flip errors, thereby eradicating the dominant process above. Unfortunately, any \(2\) bit flips (implicitly on distinct bits) in the \(7\)-bit encoding \(\mathbf t\) will not be corrected, leading to block error \(\tilde{\mathbf s}\neq\mathbf s\) (identical remarks apply to \(3,4,5,6,7\) bit flips); after all, no matter what syndrome vector \(\mathbf z\) one finds in \(\tilde{\mathbf t}\), the Hamming decoder tells one to unflip at most \(1\) bit because it operates under the assumption that there was only \(1\) bit flip, so it is literally impossible to fully correct a \(\tilde{\mathbf t}\) containing more than \(1\) bit flip. Thus:

\[p(\tilde{\mathbf s}\neq\mathbf s)=\sum_{n=2}^7{7\choose{n}}p_f^n(1-p_f)^{7-n}=21p_f^2+O(p_f^3)\]

By contrast, the effective bit flip probability is in general:

\[p_f^{\text{eff}}=\frac{1}{L_{\mathbf s}}\sum_{i=1}^{L_{\mathbf s}}p(\tilde s_i\neq s_i)\]

But here all \(L_{\mathbf s}=4\) source bits are indistinguishable (indeed all \(L_{\mathbf t}=7\) transmitted bits are indistinguishable), so one can focus arbitrarily on e.g. \(i=1\) and ask what is \(p_f^{\text{eff}}=p(\tilde s_1\neq s_1)\). There are then \(2\) distinct ways to obtain the same answer (to leading order); the first is to explicitly enumerate all the possible \(2\) bit flip errors in \(\tilde{\mathbf t}\) such that the final decoded source vector \(\tilde{\mathbf s}\) will have an error in bit \(\tilde s_1\neq s_1\); clearly there are \(6\) such combinations that look like \(s_1\) being flipped in addition to one of the other \(6\) bits; then, another \(3^{\text{rd}}\) bit will always be flipped. Meanwhile, one can check there are also \(3\) other configurations in which \(s_1\) is not initially flipped, but rather the decoder incorrectly flips it when going from \(\tilde{\mathbf t}\to\tilde{\mathbf s}\):

Since each of these are still \(2\) bit flip processes, one thus has:

\[p_f^{\text{eff}}=9p_f^2(1-p_f)^5+…=9p_f^2+O(p_f^3)\]

To leading order, one thus observes that \(p_f^{\text{eff}}=\frac{3}{7}p(\tilde{\mathbf s}\neq\mathbf s)\); the \(3/7\) prefactor has a simple interpretation as saying that whenever there was \(2\) bit flips in \(\tilde{\mathbf t}\), it would inevitably imply \(3\) bit flips in the decoded \(\tilde{\mathbf s}\) (prior to truncating away the \(3\) parity-check bits) so \(3\) out of the \(7\) bits in \(\tilde{\mathbf s}\) are in error, and all \(7\) bits are indistinguishable. This thus has a very Bayesian interpretation where \(p(\tilde{\mathbf s}\neq\mathbf s)\) is the prior and \(3/7\) is the conditional probability of a bit like \(s_1\) being flipped if it’s known that the whole \(7\)-bit block has suffered a \(3\)-bit error.

Problem: How many distinct elements are there in \(\ker H\)? What is the interpretation of such an element?

Solution: There are \(15\) elements in \(\ker H\), and each such \(\mathbf n\in\ker H\) has the property that the corresponding syndrome \(\mathbf z=\mathbf 0\) if \(\textbf n\) were interpreted as a certain symmetric pattern of bit flips/noise. For example:

Problem: The \((7,4)\) Hamming code is not the only one; what other possible \((L_{\mathbf t},L_{\mathbf s})\) Hamming codes are possible?

Solution: Any for which \(L_{\mathbf t}-L_{\mathbf s}=\log_2(L_{\mathbf t}+1)\), the RHS being the number of parity-check bits which grows merely logarithmically with the total transmission size \(L_{\mathbf t}\)!

Problem: What are the \(2\) main interpretations of probability?

Solution: There is the frequentist and Bayesian interpretations of probability; the former interprets probabilities as long-run event frequencies while the latter inteprets probabilities as degrees of belief (this is possible provided the concept of “degree of belief” satisfies certain reasonable Cox axioms). In general, probability theory is simply an extension of logic to account for uncertainty.

Problem: Whenever dealing with multiple random variables \(X,Y,…\), what is the fundamental object to look at? What can be derived from this fundamental object? (cf. the partition function in statistical mechanics as serving the same sort of “all-encompassing” role).

Solution: The fundamental object which one should prioritize figuring out is the joint probability distribution function \(p_{(X,Y,…)}(x,y,…)\) of all the random variables \((X,Y,…)\). Consider a simple visualization of this concept for just \(2\) real-valued random variables \(X, Y\) and the marginal as well as conditional probability distributions one can obtain from their joint probability distribution \(p_{(X,Y)}(x,y)\equiv p(x,y)\):

Problem: A certain disease is present among \(1\%\) of the world’s population, and a medical test for the disease has a \(95\%\) true positive rate and a \(90\%\) true negative rate. If Joe tests positive for the disease, what are his odds of actually having it?

Solution: For these sorts of problems, there is essentially a \(3\)-step recipe to solving them:

  1. Map out the complete marginal distribution of the prior.
  2. For each value of the prior, “unshovel” your way back to the joint distribution w.r.t. all values of the new data.
  3. Hence, compute whatever you need to.

In this case, the prior \(X\) may be modelled as a Bernoulli random variable taking on the value \(x=0\) if a person doesn’t have the disease and \(x=1\) if they do. Following the steps:

Problem: Suppose that on a certain online store, products can be rated either positively or negatively. There are \(3\) different sellers of the same product. Seller \(A\) has \(10\) reviews with \(100\%\) positive ratings, seller \(B\) has \(50\) reviews with \(96\%\) positive ratings, and seller \(C\) has \(200\) reviews with \(93\%\) positive ratings. Which seller should one buy from?

Solution: This is another Bayesian inference problem (draw a picture with \(p_+\) on the horizontal axis and \(N_+\) on the vertical axis!). The model is that the seller has some underlying fixed, but unknown probability \(p_+\) of delivering a positive experience and \(1-p_+\) of delivering a negative experience to a user. A priori, if one had no knowledge about the review information of other customers, then the prior \(p_+\) can be modelled by a uniform distribution \(p(p_+)=1\) for \(p_+\in[0,1]\). But then, equipped with the review info of \(N_+\) positive reviews out of \(N\), Bayesian updating says that this uniform prior should be updated to a beta distributed posterior:

\[p(p_+|N_+,N)=\frac{p(p_+)p(N_+,N|p_+)}{\int_0^1dp_+p(p_+)p(N_+,N|p_+)}=\frac{{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}{\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}}=\frac{(N+1)!p_+^{N_+}(1-p_+)^{N-N_+}}{N_+!(N-N_+)!}\]

where the beta function \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\) normalization has been (optionally) used.

A reasonable goal would be to maximize one’s own probability of a positive experience. This means taking the posterior \(p(p_+|N_+,N)\) computed above and recycling that as one’s prior \(p(p_+|N_+,N)\mapsto p(p_+)\). Then, the probability \(p(+)\) of a positive experience coincides with the expected value of \(p_+\):

\[p(+)=\int_0^1 dp_+p(p_+)p_+=\frac{\textrm{B}(N_++2,N-N_++1)}{\textrm{B}(N_++1,N-N_++1)}=\frac{N_++1}{N+2}\]

this result is called Laplace’s rule of succession, and may be remembered by taking the original \(N\) reviews, and appending \(2\) new fictitious reviews in which one is positive and one is negative. For instance, if a product had no reviews so that \(N=N_+=0\), then reasonably enough Laplace’s rule of succession predicts one to have a \(p(+)=1/2\) probability of a positive experience. Applying it to each of the sellers in the question, one finds that seller \(B\) has the highest probability \(p(+)\), so should vendor with them.

Remember that Bayesian inference is always predicated on assumptions, which above meant a uniform equiprobable ignorance \(p(p_+)=1\) at the very beginning about \(p_+\), though it has been suggested by Jeffreys, Haldane, etc. to use a different prior like \(p(p_+)=\frac{1}{\pi\sqrt{p(1-p)}}\) or \(p(p_+)\sim\frac{1}{p(1-p)}\) to model one’s ignorance. Of course, this would modify how Laplace’s rule of succession looks with each of these priors.

Problem: (for fun!) Prove the earlier beta function relation \(\textrm{B}(z,w):=\int_0^1 dx x^{z-1}(1-x)^{w-1}=\frac{\Gamma(z)\Gamma(w)}{\Gamma(z+w)}\). Show that in the special case of positive integers \(z,w\in\mathbf Z^+\), the result may be derived combinatorially.

Solution:

(loosely speaking, this is reminiscent of the identity for the exponential function \(\exp(z)\exp(w)=\exp(z+w)\) except for the \(\Gamma\) function there is an “extra factor” in the form of the beta function \(B(z,w)\)). In the special case where \(z,w\in\mathbf Z^+\), the initial temptation would be to expand \((1-x)^{w-1}\) using the binomial theorem, but it turns out there is a much cuter way to evaluate the integral by telling a story.

Imagine seeing a random distribution of \(N+1\) balls on the unit interval \([0,1]\), of which \(1\) of them is pink while the other \(N\) are black. There are \(2\) ways they could’ve gotten there. On the one hand, it may be that there were initially \(N+1\) black balls, and then one of them was painted pink at random, and then the balls were thrown altogether onto \([0,1]\). Or it may be that the \(N+1\) black balls were first thrown altogether onto \([0,1]\), and subsequently one of them was painted pink at random. Clearly, these \(2\) processes are equivalent definitions of a discrete random variable \(N_+\) defined as the number of black balls to the left of the pink ball on \([0,1]\). This has probability mass function \(p(N_+)\) in the first story given by:

\[p(N_+)=\int_0^1dp_+p(\text{pink thrown at }p_+)p(N_+|\text{pink thrown at }p_+)\]

\[=\int_0^1 dp_+{N\choose{N_+}}p_+^{N_+}(1-p_+)^{N-N_+}\]

On the other hand, according to the second story, \(N_+=0,1,…,N\) each with uniform probability!

\[p(N_+)=\frac{1}{N+1}\]

Hence it is established that:

\[\int_0^1 dp_+p_+^{N_+}(1-p_+)^{N-N_+}=\frac{N_+!(N-N_+)!}{(N+1)!}\]

which is equivalent to the earlier more general result for \(z,w\in\mathbf C\).

Problem: Given a discrete random variable \(X\) with probability mass function \(p(x)\), define the information content \(I(x)\) associated to drawing outcome \(x\) from \(X\). Hence, define the entropy \(S_X\) of \(X\).

Solution: A picture is worth a thousand words:

where \(I(x):=-\log p(x)\) is the surprise associated with drawing \(x\) from \(X\), and \(S_X:=\sum_xp(x)I(x)=-\sum_x p(x)\log p(x)=\sum_x I(x)2^{-I(x)}\) is the average surprise of the discrete random variable \(X\). The choice of base in the \(\log\) defines the unit of information (e.g. bits, nats, etc.).

Problem: Show that for \(X\perp Y\) independent discrete random variables, their joint entropy \(S_{(X,Y)}=S_X+S_Y\).

Solution: This is really rooted in Shannon’s fundamental axioms that the joint information content for independent random variables should be additive \(I_{(X,Y)}(x,y)=I_X(x)+I_Y(y)\):

\[S_{(X,Y)}:=-\sum_{(x,y)}p(x,y)\log p(x,y)=-\sum_{(x,y)}p(x)p(y)\log p(x)-\sum_{(x,y)}p(x)p(y)\log p(y)=S_X+S_Y\]

Problem: Give the intuition behind the Kullback-Leibler divergence \(D_{\text{KL}}(\tilde X|X)\) of a random variable \(\tilde X\) with respect to another random variable \(X\) (both defined over the same sample space).

Solution: Any random variable \(X\) inherently contains some non-negative average level of surprise \(S_X\geq 0\) because of the word “random” in “random variable” (indeed, \(S_X=0\) iff \(p(x)=1\) for some outcome \(x\)). That’s one kind of surprise. But in practice, if one is under the illusion that the outcomes \(x\) are being drawn from a random variable \(\tilde X\) when in fact they are actually following a different random variable \(X\neq\tilde X\), then the perceived degree of surprise \(S_{\tilde X|X}=-\sum_xp_X(x)\log p_{\tilde X}(x)\) (called the cross-entropy of \(\tilde X\) with respect to \(X\)) would intuitively be larger than the average surprise \(S_X\) inherent in the intrinsic randomness of \(X\) itself. The KL divergence of \(\tilde X\) with respect to \(X\) thus measures how surprising the data is purely due to the wrong model \(\tilde X\) being used in lieu of the ground truth \(X\):

\[D_{\textrm{KL}}(\tilde X|X):=S_{\tilde X|X}-S_X\]

thus, it should be intuitively clear that \(D_{\textrm{KL}}(\tilde X|X)\geq 0\), a result known as the Gibbs inequality.

Problem: What is the intuition behind Jensen’s inequality.

Solution: Take any “smiling” curve and hang a bunch of masses on the curve. Then Jensen’s inequality says that their center of mass will also lie above the curve.

More precisely, for a convex function \(f(x)\) (which means any secant line lies above the function’s graph, or algebraically \(f(\lambda x+(1-\lambda)x’)\leq\lambda f(x)+(1-\lambda)f(x’)\) for \(\lambda\in[0,1]\)), one has:

\[\langle f(x)\rangle\geq f(\langle x\rangle)\]

with equality \(\langle f(x)\rangle=f(\langle x\rangle)\) iff \(x\) is a uniform random variable. More explicitly, for any convex linear combination of the form \(p_1x_1+p_2x_2+…+p_Nx_N\) where each \(p_i\geq 0\) and \(\sum_{i=1}^Np_i=1\), one has:

\[\sum_i p_if(x_i)\geq f\left(\sum_ip_ix_i\right)\]

The inequality is reversed for concave functions \(f(x)\).

Problem: Unstable particles are emitted from a source and decay at a distance \(x\geq 0\), an exponentially distributed real random variable with characteristic length \(x_0\). Decay events can only be observed if they occur in a window extending from \(x=1\) to \(x=20\) (arbitrary units). If \(100\) decays are registered at some locations \(1\leq x_1,x_2,…,x_{100}\leq 20\), estimate the characteristic length \(x_0\).

Solution: The prior distribution of \(x\) is, as stated, an exponential decay:

\[p(x|x_0)=\frac{e^{-x/x_0}}{x_0}[x\geq 0]\]

which is normalized \(\int_{-\infty}^{\infty}dxp(x|x_0)=1\). After learning that \(x\in [1,20]\), the posterior distribution of \(x\) becomes:

\[p(x|x\in [1,20],x_0)=\frac{p(x|x_0)p(x\in[1,20]|x, x_0)}{p(x\in[1,20]|x_0)}\]

but if \(x\) is measured, and happens to lie in \(x\in[1,20]\), then \(p(x\in[1,20]|x, x_0)=1\), otherwise if \(x\notin [1,20]\) then \(p(x\in[1,20]|x, x_0)=0\); all in all one has the compact Iverson bracket expression \(p(x\in[1,20]|x, x_0)=[x\in [1,20]]\). The denominator is the integral of the numerator and indeed is just the usual way that probability density functions are meant to be used:

\[p(x\in[1,20]|x_0)=\int_{-\infty}^{\infty}dxp(x|x_0)p(x\in[1,20]|x,x_0)=\int_1^{20}dx\frac{e^{-x/x_0}}{x_0}=e^{-1/x_0}-e^{-20/x_0}\]

Now, using this updated prior, one has:

\[p(\{x_1,…,x_N\}|x\in[1,20],x_0)=p(x_1|x\in[1,20],x_0)…p(x_N|x\in[1,20],x_0)=\frac{e^{-(x_1+…+x_N)/x_0}}{x_0^N(e^{-1/x_0}-e^{-20/x_0})^N}\]

But one would instead like to Bayesian-infer \(p(x_0|\{x_1,…,x_N\},x\in[1,20])\) so as to find the most probable value of the characteristic length \(x_0\):

\[p(x_0|\{x_1,…,x_N\},x\in[1,20])=\frac{p(x_0|x\in[1,20])p(\{x_1,…,x_N\}|x\in[1,20],x_0)}{p(\{x_1,…,x_N\}|x\in[1,20])}\]

Clearly, the only missing puzzle piece in here is the prior distribution \(p(x_0|x\in[1,20])\) on the characteristic length itself. Even without specifying this, one can still gain a lot of insight by plotting \(p(\{x_1,…,x_N\}|x\in[1,20],x_0)\) as a function of \(x_0\), or more precisely, to get the key idea, plot the likelihood \(p(x|x\in[1,20],x_0)\) of \(x_0\) for various fixed \(x\in[1,20]\), and notice in particular a peak in \(x_0\) as \(x\) decreases.

Posted in Blog | Leave a comment

SVD, Pseudoinverse, and PCA

Problem: State the form of the singular value decomposition (SVD) of an arbitrary linear operator \(X:\mathbf C^m\to\mathbf C^n\).

Solution: The SVD of \(X\) is given by:

\[X=U_2\Sigma U^{\dagger}_1\]

where \(U_1\in U(m)\) and \(U_2\in U(n)\) are unitary operators and \(\Sigma:\mathbf C^m\to\mathbf C^n\) is a “Hermitian” positive semi-definite diagonal operator. Strictly speaking, one should prove that such a factorization of “rotate-stretch-rotate” really does exist for any \(X\); the details are omitted here.

Problem: In practice, how does one find the SVD of a linear operator \(X\)?

Solution: The idea is to look at the \(2\) Hermitian, positive semi-definite operators constructed from \(X\):

\[XX^{\dagger}=U_2\Sigma\Sigma^{\dagger}U_2^{\dagger}\]

\[X^{\dagger}X=U_1\Sigma^{\dagger}\Sigma U_1^{\dagger}\]

And in particular, by the spectral theorem, both \(XX^{\dagger}\) and \(X^{\dagger}X\) have an orthonormal eigenbasis (of \(\mathbf C^n\), \(\mathbf C^m\) respectively) corresponding to real, non-negative eigenvalues \(\sigma^2_1>\sigma^2_2>…>\sigma^2_{\text{min}(n,m)}\), so since the above matches the standard eigendecomposition template, it’s clear that computing the SVD of \(X\) just amounts to diagonalizing \(XX^{\dagger}\) and \(X^{\dagger}X\) (actually just \(1\) has to be diagonalized, namely diagonalize \(XX^{\dagger}\) if \(n\leq m\) and diagonalize \(X^{\dagger}X\) if \(m\leq n\). Then, because the singular eigenvalues \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\) are identical for both \(XX^{\dagger}\) and \(X^{\dagger}X\), it’s easy to find the eigenvectors of the other).

Problem: The most important application of SVD is to estimate a complicated linear operator \(X:\mathbf C^m\to\mathbf C^n\) by a low-rank approximation. Explain how this is achieved.

Solution: Order the \(\text{min}(n,m)\) singular values of \(X:\mathbf C^m\to\mathbf C^n\) from greatest to least as before \(\sigma^2_1,\sigma^2_2,…\sigma^2_{\text{min}(n,m)}\). Write the column eigenvectors \(U_1=(\mathbf u_1^{(1)}, \mathbf u_1^{(2)},…,\mathbf u_1^{(m)})\) and \(U_2=(\mathbf u_2^{(1)}, \mathbf u_2^{(2)},…,\mathbf u_2^{(n)})\). Then the singular value decomposition may also be written:

\[X=U_1\Sigma U^{\dagger}_2=\sum_{i=1}^{\text{min}(n,m)}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

In particular, the Eckart-Young theorem states the best rank-\(r\) approximation to \(X\) (for \(r\leq\text{rank}(X)\)) is given by a truncating the above series at the first \(r\) terms:

\[X\approx \sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

Here, the word “best” is with respect to the Frobenius norm:

\[\text{argmin}_{\tilde X\in\mathbf C^{n\times m}:\text{rank}(\tilde X)=r}\sqrt{\text{Tr}((X-\tilde X)^{\dagger}(X-\tilde X))}=\sum_{i=1}^{r}\sigma_i\mathbf u_1^{(i)}\otimes\mathbf u_2^{(i)}\]

(insert proof here:)

Problem: Here is another application of SVD: given a diagonal operator \(\Sigma:\mathbf C^m\to\mathbf C^n\), explain how to compute its (Moore-Penrose) pseudoinverse \(\Sigma^+:\mathbf C^n\to\mathbf C^m\). Hence, how can the pseudoinverse of a general linear operator \(X:\mathbf C^m\to\mathbf C^n\) be computed?

Solution: \(\Sigma^+\) is almost like the usual inverse \(\Sigma^{-1}\) in that one simply inverts all the eigenvalues on the diagonal. But the exception is if \(\Sigma\) has a zero eigenvalue…then ordinarily \(\Sigma^{-1}\) would not exist, so for this case the pseudoinverse is created and the instruction is to simply leave the zeros be (rather than trying to invert them and get \(1/0=\infty\)). Finally, also don’t forget to transpose the result. For a general \(X\), the quickest way to compute the pseudoinverse is to use its SVD:

\[X=U_2\Sigma U^{\dagger}_1\]

Ordinarily trying to take the inverse (even if \(n\neq m\)) would appear to give:

\[X^{-1}=U_1\Sigma^{-1}U^{\dagger}_2\]

But since \(\Sigma^{-1}\) may not exist, the pseudoinverse of \(X\) is thus given by:

\[X^+=U_1\Sigma^+U_2^{\dagger}\]

(of course, if \(X^{-1}\) actually exists then the pseudoinverse is in fact not “pseudo” at all \(X^+=X^{-1}\)).

Problem: In general, given a vector \(\mathbf y\in\mathbf C^n\) and linear operator \(X:\mathbf C^m\to\mathbf C^n\), the equation \(\mathbf y=X\mathbf x\) only has a unique solution for \(\mathbf x\in\mathbf C^m\) provided that \(X\) is invertible, i.e. \(\mathbf x=X^{-1}\mathbf y\). If this isn’t the case however, one can still compute the pseudoinverse \(X^+\); what interpretation does \(X^+\mathbf y\in\mathbf C^m\) have?

Solution: There are \(2\) cases; if \(m\leq n\), then there will be many solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\), and in particular \(X^+\mathbf y\) is also an element of this solution space (i.e. \(XX^+\mathbf y=\mathbf y\)) but has the additional property that it is closest to the origin:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m:\mathbf y=X\mathbf x}|\mathbf x|\]

By contrast, if \(m\geq n\), then there will be no solutions \(\mathbf x\in\mathbf C^m\) to \(\mathbf y=X\mathbf x\) and in that case of course \(X^+\mathbf y\) is also not a solution, but nevertheless it is the closest one can come to an actual solution in the sense that:

\[X^+\mathbf y=\text{argmin}_{\mathbf x\in\mathbf C^m}|\mathbf y-X\mathbf x|\]

Problem: Explain the what and how of conducting principal component analysis (PCA) of a bunch of (say \(m\)) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\).

Solution: Earlier, \(X:\mathbf C^m\to\mathbf C^n\) simply had the interpretation of an arbitrary linear operator, in which case the interpretation of \(XX^{\dagger}\) and \(X^{\dagger}X\) were both likewise arbitrary. However, one intuitive interpretation is to view \(X\) as a data matrix; if one takes the \(m\) columns of \(X=(\mathbf x_1,…,\mathbf x_m)\) to represent the \(m\) feature vectors in \(\mathbf C^n\), then the Gram matrix \(X^{\dagger}X\) takes on a statistical interpretation as an autocorrelation matrix:

\[X^{\dagger}X=
\begin{pmatrix} |\mathbf{x}_1|^2 & \cdots & \mathbf{x}_1^{\dagger}\mathbf{x}_m \\
\vdots & \ddots & \vdots \\
\mathbf{x}_m^{\dagger}\mathbf{x}_1 & \cdots & |\mathbf{x}_m|^2
\end{pmatrix}\]

or alternatively, if \(X\mapsto X-\boldsymbol{\mu}\otimes\mathbf{1}_m\) has already been mean-subtracted (where \(\boldsymbol{\mu}=\sum_{i=1}^m\mathbf x_i/m\)) so as to have zero-mean, then \(X^{\dagger}X\) is just the covariance matrix of the data. Based on the earlier discussion of SVD, it follows that the columns of \(U_1\in U(m)\) give principal axes of the covariance matrix \(X^{\dagger}X\), and that the corresponding singular values in \(\Sigma\) measure the standard deviation of the data along those principal axes (as variance is commonly denoted \(\sigma^2\), this also explains the choice of notation in the SVD). Note that in such applications, one is almost always in the regime \(m\gg n\) so that there will be \(n\) singular values \(\sigma_1,…,\sigma_n\). Together, each principal axis together with corresponding singular value is said to be a principal component (PC) of the data \(X\), hence the name. The “analysis” part comes from analyzing how the total variance \(\sum_{i=1}^n\sigma_i^2\) of the data \(X\) is explained by each PC, specifically the fraction of it explained by the \(j^{\text{th}}\) PC is \(\sigma_j^2/\sum_{i=1}^n\sigma_i^2\). If the singular values are ordered \(\sigma_1\geq…\geq\sigma_n\), then PCA also provides a lossy data compression algorithm, specifically, in analogy to the low-rank approximation application of SVD, one can pick some \(r< n\) and project all \(m\) feature vectors \(\mathbf x_1,…,\mathbf x_m\in\mathbf C^n\) onto the \(r\)-dimensional subspace spanned by the corresponding first \(r\) principal axes associated to the first \(r\) singular values \(\sigma_1,…,\sigma_r\).

If the data matrix contained the feature vectors as its rows instead \(X=(\mathbf x_1,…,\mathbf x_m)^{\dagger}\), then repeat the above discussion with \(XX^{\dagger}\) instead.

Posted in Blog | Leave a comment

Hartree-Fock Method

Problem: The Hamiltonian of an atom may be written in the perturbative form:

\[H=H_0+V\]

what are \(H_0\) and \(V\)?

Solution: If the atom has \(N\) electrons and atomic number \(Z\) (if neutral then \(Z=N\) but here one can also allow the possibility of ions), then the state space of the atom is considered to be \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), i.e. only the \(N\) electrons are considered part of the system, not the nucleus; rather, the nucleus plays an external role, providing the background Coulomb potential in which the electrons move. Thus, \(H_0\) is given by the one-body operator:

\[H_0=\sum_{i=1}^N\frac{|\mathbf P_i|^2}{2m}+V_{\text{ext}}(\mathbf X_i)\]

where \(V_{\text{ext}}(\mathbf x)=-\frac{Z\alpha\hbar c}{|\mathbf x|}\) and the perturbative interaction potential is given by:

\[V=\sum_{1\leq i<j\leq N}\frac{\alpha\hbar c}{|\mathbf X_i-\mathbf X_j|}\]

Problem: What assumptions are implicitly going into the form of \(H\) above?

Solution: Complete disregard of special relativity; specifically, no fine/hyperfine structure effects, and also using the Coulomb potential which is strictly only valid for electrostatics (with relativistic corrections assumed to be small).

Problem: The goal as always in QM is to find the energies \(E\) and corresponding eigenstates of \(H\) in \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\). However, for \(N\geq 2\), no exact such \(H\)-eigenstates are known, due to the presence of the repulsive electron-electron interactions in \(V\); if for a moment one assumes \(V=0\), then the exact \(H\)-eigenstates and energies are again known…what are they?

Solution: The ground state consists of occupying (also called filling) the first \(N\) available single-particle eigenstates of the hydrogenic \(Z\)-atom (antisymmetrized appropriately) as per Pauli exclusion; this of course is analogous to the usual Fermi sea picture where \(V_{\text{ext}}=0\) and \(H_0\)-eigenstates are just plane waves \(|\mathbf k\rangle\). Excited states are then analogous to excitations of the Fermi sea.

Problem: What does it actually mean to say that e.g. a certain atom like \(\text{Cr}\) has ground state “electron configuration” \(1s^22s^22p^63s^23p^64s^13d^5\)?

Solution: On the one hand, electron configuration is just a notation for a particular ket in what is almost an occupation number representation for the single-electron basis \(\{|n\ell m_{\ell}m_s\rangle\}\), e.g.

\[1s^22s^1=|N_{|100\uparrow\rangle}=1, N_{|100\downarrow\rangle}=1,N_{|200\uparrow\rangle}=1\rangle\]

though of course the last occupation number might also have been \(N_{|200\downarrow\rangle}=1\); thus, electron configuration is a coarse-graining of the occupation number representation over the quantum numbers \(m_{\ell},m_s\)). I would conjecture then that the ground state electron configuration assigned to a given atom is the one which has the largest overlap with its true ground state in the presence of Coulomb repulsion \(V_{\text{int}}\) (and even including all the relativistic effects). This can be formalized in terms of the Hartree-Fock ansatz explained later.

Problem: As a warmup, show how first-order time-independent perturbation theory may be applied to estimate the ground state energy of a helium atom. How does it compare with the experimental result?

Solution: If the Coulomb repulsion between the two electrons in the helium atom are assumed to be “weak”, then \(V=\frac{\alpha\hbar c}{|\mathbf X_1-\mathbf X_2|}\) may be treated as a perturbation of \(H_0\). The ground state \(|\Psi\rangle\) of \(H_0\) is has position-space wavefunction:

\[\langle\mathbf x_1,\mathbf x_2|\Psi\rangle=\psi_{100}(\mathbf x_1)\psi_{100}(\mathbf x_2)\otimes\frac{|\uparrow\downarrow\rangle-|\downarrow\uparrow\rangle}{\sqrt 2}\]

where \(\psi_{100}(r)=\sqrt{\frac{Z^3}{\pi a_0^3}}e^{-Zr/a_0}\). Accordingly, because \(V\) is blind to the spin degrees of freedom, \(1^{\text{st}}\)-order perturbation theory asserts that the \(1^{\text{st}}\)-order correction to the non-interacting ground state energy \(-(Z\alpha)^2mc^2\approx -109\text{ eV}\) is given by:

\[\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{|\psi_{100}(\mathbf x_1)|^2|\psi_{100}(\mathbf x_2)|^2}{|\mathbf x_1-\mathbf x_2|}\]

\[=\alpha\hbar c\left(\frac{Z^3}{\pi a_0^3}\right)^2\int d^3\mathbf x_1 e^{-2Zr_1/a_0}\int d^3\mathbf x_2\frac{e^{-2Zr_2/a_0}}{|\mathbf x_1-\mathbf x_2|}\]

First, to evaluate the inner integral, notice that on symmetry grounds one can take \(\mathbf x_1=r_1\hat{\mathbf z}\) without loss of generality, thereby aligning it with the usual spherical coordinates to obtain:

\[\int d^3\mathbf x_2\frac{e^{-2Zr_2/a_0}}{|\mathbf x_1-\mathbf x_2|}=2\pi\int_0^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\int_{-1}^1\frac{d\cos\theta}{\sqrt{r_1^2+r_2^2-2r_1r_2\cos\theta}}\]

\[=2\pi\int_0^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{r_1+r_2-|r_1-r_2|}{r_1r_2}\]

\[=2\pi\int_0^{r_1}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{2}{r_1}+2\pi\int_{r_1}^{\infty}dr_2 r_2^2e^{-2Zr_2/a_0}\frac{2}{r_2}\]

\[= \pi \left( \frac{a_0^3}{Z^3 r_1} – \left( \frac{a_0^2}{Z^2} + \frac{a_0^3}{Z^3 r_1} \right) e^{-2Zr_1/a_0} \right)\]

using integration by parts. Plugging this result back into the main integral and this time doing the \(\int_0^{\infty}dr_1\) integrals quickly using the \(\Gamma\) function, one eventually ends up with a positive \(1^{\text{st}}\)-order energy correction \(5Z\alpha^2mc^2/8\) (of course it should be positive given that the Coulomb repulsion \(V\) is positive-definite). The total ground state energy is therefore:

\[(-Z^2+5Z/8)\alpha^2mc^2\approx -74.8\text{ eV}\]

which for \(Z=2\) is close to the actual helium ground state energy \(\approx -79\text{ eV}\).

Problem: Show that a variational ansatz:

\[\Psi(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})=\frac{Z_{\text{eff}}^3}{\pi a_0^3}e^{-2Z_{\text{eff}}(r_1+r_2)/a_0}\]

with the effective nuclear charge \(Z_{\text{eff}}\) the corresponding variational parameter to be optimized, produces a better estimate of the ground state energy of a helium atom. What is the physical interpretation of \(Z_{\text{eff}}\)?

Solution: The variational ansatz must do at least as well as \(1^{\text{st}}\)-order perturbation theory because setting \(Z_{\text{eff}}:=Z\) would reproduce the perturbative result. Thus, the ground state energy to be minimized is:

\[E_0(Z_{\text{eff}})=\int d^3\mathbf x_1 d^3\mathbf x_2\Psi^{\dagger}(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})H_Z\Psi(\mathbf x_1,\mathbf x_2|Z_{\text{eff}})\]

where \(H=H_Z\) emphasizes that the original helium Hamiltonian \(H\) is being used with \(Z=2\). Of course though, it can be rewritten as:

\[H_Z=H_{Z_{\text{eff}}}+(Z_{\text{eff}}-Z)\alpha\hbar c\left(\frac{1}{|\mathbf X_1|}+\frac{1}{|\mathbf X_2|}\right)\]

So:

\[E_0(Z_{\text{eff}})=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2(Z_{\text{eff}}-Z)\alpha\hbar c\int d^3\mathbf x\frac{|\psi_{100}(r|Z_{\text{eff}})|^2}{r}\]

\[=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2(Z_{\text{eff}}-Z)\alpha\hbar c\frac{Z_{\text{eff}}^3}{\pi a_0^3}\int_0^{\infty}dr 4\pi r^2\frac{e^{-2Z_{\text{eff}}r/a_0}}{r}\]

\[=(-Z_{\text{eff}}^2+5Z_{\text{eff}}/8)\alpha^2mc^2+2Z_{\text{eff}}(Z_{\text{eff}}-Z)\alpha^2mc^2\]

\[=\left(Z_{\text{eff}}^2-\left(\frac{5}{8}-2Z\right)Z_{\text{eff}}\right)\alpha mc^2\]

which is clearly minimized when \(Z_{\text{eff}}=Z-\frac{5}{16}=1.6875\). The phenomenon \(Z_{\text{eff}}<Z\) is a manifestation of screening.

Problem: What does the Aufbau principle assert?

Solution: Empirically (though with several exceptions) the occupation of single-electron states in the ground states of the atoms (as per their electron configurations) proceeds like:

which of course was subsequently built into the topology of the periodic table.

Problem: Previously, both perturbation theory and the variational method were used to estimate the ground state energy of the helium atom. One can now ask: what about the \(1^{\text{st}}\) excited state?

Solution: From the perspective of \(H_0\), the \(1^{\text{st}}\) excited manifold has \(16\)-fold degeneracy; specifically, by applying the \(2\)-antisymmetrizer \(\sqrt{2}\mathcal A_2\) to any of the following, one obtains an \(H_0\)-eigenstate in the \(1^{\text{st}}\) excited subspace \(\text{ker}(H_0-E_21)\cap\bigwedge^2L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\):

(aka Slater determinants). Even though this of course provides a basis for the \(16\)-dimensional space \(\text{ker}(H_0-E_21)\cap\bigwedge^2L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), some of these Slater determinants (e.g. \(\sqrt{2}\mathcal A_2|100\uparrow,200\downarrow\rangle\)) are not eigenstates of \(V\). Rather, because \([V,\mathbf L^2]=[V,L_z]=0\) is isotropic and \([V,\mathbf S^2]=[V,S_z]=0\) is furthermore spin-blind, where \(\mathbf L=\mathbf L_1+\mathbf L_2\) and \(\mathbf S=\mathbf S_1+\mathbf S_2\), it makes sense to instead work in an unentangled basis, specifically where each basis state factorizes into the product of a spatial wavefunction and a spinorial factor. This is achieved by the usual Clebsch-Gordan decomposition \((1/2)^{\otimes 2}=0\oplus 1\), and means that one should instead hybridize some of the Slater determinants:

so that one is still left with \(16\) states at the end of the day, but now all of them are also simultaneous \((\mathbf S^2,S_3)\)-eigenstates. Now degenerate perturbation theory is very easy. Given \(|\Psi\rangle\in\biggr\{\frac{|100,200\rangle+|200,100\rangle}{\sqrt{2}}|0,0\rangle,\frac{|100,200\rangle-|200,100\rangle}{\sqrt{2}}|1,-1\rangle,…\biggr\}\) where the quantum numbers for \(\mathbf S^2,S_3\) are being used, because there are \(4\) different combinations of \((\ell,m_{\ell})=(0,0),(1,1),(1,0),(1,-1)\), one expects that to \(1^{\text{st}}\)-order the \(1^{\text{st}}\)-excited manifold’s \(16\)-fold degeneracy will decompose into \(4\) submanifolds each of \(4\)-fold degeneracy. Specifically:

\[\langle\Psi|V|\Psi\rangle=J\pm K\]

where \(J\) is called the direct energy and \(K\) the exchange energy (both are between \(2\)-particular \(|n\ell m_{\ell}\rangle\) states, and the \(+\) sign is taken if the spatial wavefunction is symmetric, while the \(-\) sign is taken if it’s antisymmetric). For example, for \(|\Psi\rangle=\frac{|100,211\rangle-|211,100\rangle}{\sqrt{2}}|1,1\rangle\), take the \(-\) sign with:

\[J_{100,211}=\langle 100,211|V|100,211\rangle=\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{|\psi_{100}(\mathbf x_1)\psi_{211}(\mathbf x_2)|^2}{|\mathbf x_1-\mathbf x_2|}=\frac{59}{243}Z\alpha^2mc^2\]

\[K_{100,211}=\langle 100,211|V|211,100\rangle=\alpha\hbar c\int d^3\mathbf x_1 d^3\mathbf x_2\frac{\psi^{\dagger}_{100}(\mathbf x_1)\psi^{\dagger}_{211}(\mathbf x_2)\psi_{211}(\mathbf x_1)\psi_{100}(\mathbf x_2)}{|\mathbf x_1-\mathbf x_2|}=\frac{112}{6561}Z\alpha^2mc^2\]

And one indeed one can make an energy-splitting diagram that gives credence to the Aufbau rule (of course, the energy of all \(H_0\)-eigenstates increases with the electronic Coulomb repulsion \(V\)):

Note if one had instead used the Slater basis, ultimately one would still arrive at the same answer through having to find eigenvalues of block-diagonal matrices with blocks that look like \(\begin{pmatrix} J&-K\\-K&J\end{pmatrix}\).

(aside: a key selection rule \(\Delta s=0\) for \(\text E1\) transitions combined with the fact that \(s=0\) for the ground state means that starting in any \(s=1\) excited state would seem to be metastable…and indeed this is a forbidden transition, leading to a lifetime around \(\tau\approx 2.2^{\text h}\). Historically it was thought that \(s=0\) excited states corresponded to “parahelium” while \(s=1\) excited states corresponded to “orthohelium”).

Problem: After all the perturbation theory math in the above problem, what’s the key conceptual takeaway?

Solution: For unentangled states which may be factorized as the product of a spatial wavefunction and a spinorial component, when the spatial wavefunction is antisymmetric, the corresponding fermions are more likely to be found far from each other, and consequently the direct energy would be suppressed by this destructive quantum statistical interference (hence \(J-K\)). Conversely, symmetric spatial wavefunctions are enhanced by a sort of constructive quantum statistical interference (hence \(J+K\)).

Of course, one could also apply a variational approach just like was done for the ground state. This would confirm that electron screening is the key physics at play.

Problem: Show that for the Hartree product ansatz:

\[\Psi(\mathbf x_1,…,\mathbf x_N)=\psi_1(\mathbf x_1)…\psi_N(\mathbf x_N)\]

where \(\psi_1,…,\psi_N\) are \(N\) normalized single-particle wavefunctions, the expected energy \(\langle\Psi|H|\Psi\rangle\) with respect to the usual atomic Hamiltonian \(H\) is given by:

\[\langle\Psi|H|\Psi\rangle=-\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x\psi^{\dagger}_i(\mathbf x)\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_i(\mathbf x)-Z\alpha\hbar c\sum_{i=1}^N\int d^3\mathbf x\frac{|\psi_i(\mathbf x)|^2}{|\mathbf x|}\]

\[+\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi_i(\mathbf x)\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

(the last term is a sum over all pairs of direct energies \(J_{ij}\); there is no exchange energies \(K_{ij}\) yet because the Hartree ansatz \(|\Psi\rangle\) does not live in \(\bigwedge^N L^2(\mathbf R^3\to\mathbf C)\otimes\mathbf C^2\), indeed the ansatz doesn’t have any knowledge of the \(\mathbf C^2\) spinor! These deficiencies will be addressed later by the Hartree-Fock ansatz).

Solution: The key point about the Hartree product ansatz is that it’s separable, so just plug and separate integrals and remember that the \(\psi_i\) are normalized.

Problem: One would like to estimate the ground state wavefunction \(\Psi(\mathbf x_1,…,\mathbf x_N)\) of the \(N\)-electron atom within the Hartree framework (essentially mean-field theory). How can this be achieved?

Solution: By constrained minimization of the energy functional of \(\psi_1,…,\psi_N\):

\[\langle\Psi|H|\Psi\rangle-\sum_{i=1}^NE_i\left(\int d^3\mathbf x|\psi_i(\mathbf x)|^2-1\right)\]

where \(E_1,…,E_N\) are \(N\) Lagrange multipliers enforcing normalization constraints on the \(\psi_i\). Setting the functional derivative with respect to \(\psi^{\dagger}_i\) to zero is the quickest way to get the Schrodinger-equation like answer for each of the \(N\) single-particle wavefunctions \(\psi_i\):

\[\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V_i(\mathbf x)\right)\psi_i(\mathbf x)=E_i\psi_i(\mathbf x)\]

where:

\[V_i(\mathbf x)=\alpha\hbar c\sum_{j\neq i}\int d^3\mathbf x’\frac{|\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

is the Coulomb electrostatic potential energy experienced by the \(i^{\text{th}}\) electron due to the other \(N-1\) electrons (cf. the classical Coulomb electrostatic potential \(\phi(\mathbf x)=\frac{1}{4\pi\varepsilon_0}\int d^3\mathbf x’\frac{\rho(\mathbf x’)}{|\mathbf x-\mathbf x’|}\) with the charge density \(\rho(\mathbf x’)=-e|\psi(\mathbf x’)|^2\)). This system of \(N\) coupled nonlinear integrodifferential PDEs for \(\psi_1,…,\psi_N\) are also called the Hartree equations.

Problem: How are the Hartree equations solved numerically?

Solution: One can begin with a crude guess for the ground state \(\psi_i\) (such as the usual “Fermi sea” picture with hydrogenic atomic orbitals) and for each \(i=1,…,N\) obtain the interaction potential \(V_i(\mathbf x)\). Then, numerically solve the resulting \(i=1,…,N\) Schrodinger equations to obtain new \(\psi_i\), and repeat this process until hopefully it converges (this is also called the self-consistent field method).

Problem: Once these SCF iterations converges (hopefully) to some optimal set of \(N\) single-particle wavefunctions \(\psi^{\star}_i\), the corresponding Hartree estimate of the ground state wavefunction is of course \(\Psi^{\star}(\mathbf x_1,…,\mathbf x_N)=\psi^{\star}_1(\mathbf x_1)…\psi^{\star}_N(\mathbf x_N)\), but what about its corresponding ground state energy?

Solution: Although on the one hand this is just \(\langle\Psi^{\star}|H|\Psi^{\star}\rangle\), one can actually also write this in terms of the optimal Lagrange multipliers \(E^{\star}_i\), specifically because from the Hartree equations one can isolate:

\[E^{\star}_i=\int d^3\mathbf x\psi^{\star\dagger}_i(\mathbf x)\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V^{\star}_i(\mathbf x)\right)\psi^{\star}_i(\mathbf x)\]

So summing \(\sum_{i=1}^NE^{\star}_i\) would almost give \(\langle\Psi^{\star}|H|\Psi^{\star}\rangle\) except that it double-counts the direct energy term \(\sum_{i=1}^N\sum_{j\neq i}=2\sum_{1\leq i<j\leq N}\) so overall:

\[\langle\Psi^{\star}|H|\Psi^{\star}\rangle=\sum_{i=1}^NE^{\star}_i-\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi^{\star}_i(\mathbf x)\psi^{\star}_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

(more precisely, since this is a variational estimate, the true ground state energy must be less than this).

Problem: Qualitatively, why is the ground state electron configuration of \(Z=19\) potassium \(\text K\) given by \(1s^22s^22p^63s^23p^64s^1\) rather than \(1s^22s^22p^63s^23p^63d^1\)?

Solution: Due to screening, a \(3d\) electron would have higher energy than a \(4s\) electron, so as far as the ground state is concerned, the latter is preferable. Hence, the Aufbau principle says \(4s\) is occupied before \(3d\). This can sort of be seen in the Hartree equations where, approximating \(V_i=V_i(r)\) as being approximately isotropic, one can define an effective nuclear charge \(Z_{\text{eff}}(r)\) such that the effective potential experienced by the \(i^{\text{th}}\) electron looks like:

\[-\frac{Z\alpha\hbar c}{r}+V_i(r)=-\frac{Z_{\text{eff}}(r)\alpha\hbar c}{r}\]

where \(\lim_{r\to 0}Z_{\text{eff}}(r)=Z=19\) and \(\lim_{r\to \infty}Z_{\text{eff}}(r)=1\)

Problem: The Hartree product ansatz \(\Psi(\mathbf x_1,…,\mathbf x_N)\) was purely a position space many-body wavefunction that did not involve any mention of spin (if it did, it would have canceled out in \(\langle\Psi|H|\Psi\rangle\) anyways because \(H\) is spin-blind). Even so, it didn’t live in \(\bigwedge^NL^2(\mathbf R^3\to\mathbf C)\). These deficiencies can be remedied via the Hartree-Fock ansatz, which is simply an antisymmetrization of the Hartree ansatz:

\[|\Psi\rangle=\sqrt{N!}\mathcal A_N|1\rangle\otimes…\otimes|N\rangle\]

where each single-particle spin-orbital \(\langle\mathbf x|i\rangle=\psi_i(\mathbf x)|\sigma\rangle\) with \(\sigma\in\{\uparrow,\downarrow\}\). It is often also written as a Slater determinant of the \(N\) single-particle spin-orbitals, but here the assignment of which electron occupies which single-particle spin-orbital \(|i\rangle\) is implicit in the ordering of the tensor product. Repeat everything from the Hartree derivation to derive the analogous Hartree-Fock equations.

Solution: One can show that, compared to the Hartree state, the expected energy of the Hartree-Fock state contains an additional (negative) exchange energy contribution for every pair of electrons with aligned spins \(\sigma_i=\sigma_j\):

\[\langle\Psi|H|\Psi\rangle=-\frac{\hbar^2}{2m}\sum_{i=1}^N\int d^3\mathbf x\psi^{\dagger}_i(\mathbf x)\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2\psi_i(\mathbf x)-Z\alpha\hbar c\sum_{i=1}^N\int d^3\mathbf x\frac{|\psi_i(\mathbf x)|^2}{|\mathbf x|}\]

\[+\alpha\hbar c\sum_{1\leq i<j\leq N}\int d^3\mathbf x d^3\mathbf x’\frac{|\psi_i(\mathbf x)\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}-\alpha\hbar c\sum_{1\leq i<j\leq N}\delta_{\sigma_i\sigma_j}\int d^3\mathbf x d^3\mathbf x’\frac{\psi^{\dagger}_i(\mathbf x)\psi^{\dagger}_j(\mathbf x’)\psi_i(\mathbf x’)\psi_j(\mathbf x)}{|\mathbf x-\mathbf x’|}\]

(aside: the fact that the exchange energy lowers the total energy when spins are aligned means that the spins will prefer to align, and this is the essence of one of Hund’s rules).

Doing the constrained minimization again now yields the Hartree-Fock equations which are basically the Hartree equations with an additional non-local exchange potential:

\[\left(-\frac{\hbar^2}{2m}\biggr|\frac{\partial}{\partial\mathbf x}\biggr|^2-\frac{Z\alpha\hbar c}{|\mathbf x|}+V_{\text{int}}(\mathbf x)\right)\psi_i(\mathbf x)-\int d^3\mathbf x’ V_{\text{ex},i}(\mathbf x,\mathbf x’)\psi_i(\mathbf x’)=E_i\psi_i(\mathbf x)\]

but note that instead of \(\sum_{j\neq i}\), there is an apparent “self-energy” in the sum \(\sum_{j=1}^N\):

\[V_{\text{int}}(\mathbf x)=\alpha\hbar c\sum_{j=1}^N\int d^3\mathbf x’\frac{|\psi_j(\mathbf x’)|^2}{|\mathbf x-\mathbf x’|}\]

(so \(V_{\text{int}}\) is now the same for all electrons \(i=1,…,N\)). The self-energy is artificial though, and cancelled by a corresponding term from the exchange potential for each electron \(i\):

\[V_{\text{ex},i}(\mathbf x,\mathbf x’)=\alpha\hbar c\sum_{j=1}^N\delta_{\sigma_i\sigma_j}\int d^3\mathbf x’\frac{\psi^{\dagger}_j(\mathbf x’)\psi_j(\mathbf x)}{|\mathbf x-\mathbf x’|}\]

Problem: What is the key limitation of the Hartree-Fock method?

Solution: The fact that the Hartree-Fock ansatz only uses a single Slater determinant. Of course, the Hartree ansatz suffers from the same issue, but both in that it basically assumes that \(|\Psi\rangle\) is a non-interacting many-body state. In reality, the true ground state will of course be a superposition of Slater determinants, or in the \(2^{\text{nd}}\)-quantization view, a superposition of Fock states in the \(N\)-particle sector of the fermionic Fock space, and in that case the very language of “filling/occupying” shells/subshells which is familiar becomes unsuitable.

Posted in Blog | Leave a comment

Quantum Mechanics on Phase Space

Problem: Given a function \(f(x,p)\) defined on a single-particle classical phase space \((x,p)\) in \(1\) dimension, define the Weyl transform \(\hat f\) of \(f(x,p)\).

Solution:

\[\hat f=\int\frac{dx’dp’}{h}e^{i(p’X+x’P)/\hbar}\tilde f(x’, p’)\]

where the phase-space Fourier transform \(\tilde f(x’,p’)\) of \(f(x,p)\) is given by:

\[\tilde f(x’,p’)=\int\frac{dxdp}{h}e^{-i(p’x+x’p)/\hbar}f(x,p)\]

Thus, it is actually pretty easy to remember this formula; just take a FT of \(f\), and then to get something like \(f\) back, naturally one would like to take an inverse FT, but in this case because one wants an operator \(\hat f\), replace the corresponding \((x,p)\mapsto (X,P)\).

Problem: Motivate the Weyl transform.

Solution: For a monomial of the form \(f(x,p)=x^np^m\), the Weyl transform \(\hat f\) is a completely symmetrized (hence Hermitian) linear combination of all possible arrangements of products of \(n\) factors of \(X\) and \(m\) factors of \(P\) (more generally, \(f\in\textbf R\Leftrightarrow \hat f^{\dagger}=\hat f\)).

Problem: Show that the matrix elements of the Weyl operator \(\hat f\) in the \(X\)-eigenbasis are given by:

\[\langle x_1|\hat f|x_2\rangle=\int\frac{dp}{h}e^{ip(x_1-x_2)/\hbar}f\left(\frac{x_1+x_2}{2},p\right)\]

Solution: This is just a plug-and-chug drill:

\[\langle x_1|\hat f|x_2\rangle=\int\frac{dx’dp’}{h}\langle x_1|e^{i(p’X+x’P)/\hbar}|x_2\rangle\tilde f(x’, p’)\]

Use the simple case of the BCH formula \(e^{i(p’X+x’P)/\hbar}=e^{ip’X/2\hbar}e^{ix’P/\hbar}e^{ip’X/2\hbar}\) to obtain:

\[=\int\frac{dx’dp’}{h} e^{ip'(x_1+x_2)/2\hbar}\langle x_1|e^{ix’P/\hbar}|x_2\rangle\tilde f(x’, p’)\]

But \(e^{ix’P/\hbar}|x_2\rangle=|x_2-x’\rangle\) is just the translation operator, and \(\langle x_1|x_2-x’\rangle=\delta(x_1-x_2+x’)\) so doing the \(\delta\) gives:

\[=\int\frac{dp’}{h} e^{ip'(x_1+x_2)/2\hbar}\tilde f(x_2-x_1, p’)\]

\[=\int\frac{dp’}{h} e^{ip'(x_1+x_2)/2\hbar}\int\frac{dxdp}{h}e^{-i(p’x+(x_2-x_1)p)/\hbar}f(x,p)\]

\[=\int\frac{dxdp}{h}e^{i(x_1-x_2)p/\hbar}f(x,p)\int\frac{dp’}{h}e^{ip'(x_1+x_2-2x)/2\hbar}\]

\[=\int\frac{dxdp}{h}e^{i(x_1-x_2)p/\hbar}f(x,p)\delta\left(\frac{x_1+x_2}{2}-x\right)\]

\[=\int\frac{dp}{h}e^{ip(x_1-x_2)/\hbar}f\left(\frac{x_1+x_2}{2},p\right)\]

Problem: Hence, show that the Wigner transform inverts the Weyl transform:

\[f(x,p)=\int dx’ e^{-ipx’/\hbar}\langle x+x’/2|\hat f|x-x’/2\rangle\]

Solution: In light of the above result, it’s very natural to let \(x_1:=x+x’/2\) and \(x_2:=x-x’/2\). The result then follows by inverse FT.

Problem: Show that, for any \(2\) arbitrary operators \(\hat f,\hat g\), the trace of their product may be written in terms of their Wigner transforms \(f(x,p),g(x,p)\) via the homomorphism:

\[\text{Tr}(\hat f\hat g)=\int\frac{dxdp}{h}f(x,p)g(x,p)\]

Solution: It seems to be easier to go from RHS to LHS:

Problem: Define the Wigner quasiprobability distribution \(\rho(x,p)\).

Solution: Given a quantum mechanical system whose state space \(\mathcal H\) contains factors of \(L^2(\textbf R^3\to\textbf C)\) (i.e. spatial degrees of freedom), then the Wigner quasiprobability distribution is simply the Wigner transform of the system’s density operator \(\hat{\rho}\) divided by \(h\):

\[\rho(x,p)=\int\frac{dx’}{h} e^{-ipx’/\hbar}\langle x+x’/2|\hat{\rho}|x-x’/2\rangle\]

Problem: For a pure quantum state \(\hat{\rho}=|\psi\rangle\langle\psi|\), what is the Wigner quasiprobability distribution \(\rho(x,p)\)?

Solution: Simply:

\[\rho(x,p)=\int\frac{dx’}{h}e^{-ipx’/\hbar}\psi(x+x’/2)\psi^*(x-x’/2)\]

or, using \(\psi(x)=\int\frac{dp}{\sqrt{h}}e^{ipx/\hbar}\psi(p)\), the result can be written symmetrically in terms of momentum-space wavefunctions:

\[\rho(x,p)=\int\frac{dp’}{h}e^{ip’x/\hbar}\psi(p+p’/2)\psi^*(p-p’/2)\]

Problem: Hence, combining the previous results, show that the Wigner quasiprobability distribution \(\rho(x,p)\) has the following properties which warrant its name:

i) For any observable \(\hat H\), the expectation \(\langle \hat H\rangle\) in an arbitrary mixed ensemble with density operator \(\hat{\rho}\) is:

\[\langle \hat H\rangle=\int dxdp\rho(x,p)H(x,p)\]

where \(H(x,p)\) is the Wigner transform of \(\hat H\). In particular, as a corollary:

\[\int dxdp\rho(x,p)=1\]

ii) If \(\hat{\rho}=|\psi\rangle\langle\psi|\) is a pure state, then one can obtain the position space probability density via:

\[|\psi(x)|^2=\int dp\rho(x,p)\]

and similarly, to obtain the momentum space probability density:

\[|\psi(p)|^2=\int dx\rho(x,p)\]

iii) \(\rho(x,p)\in\textbf R\) is real-valued.

iv) The transition probability between two arbitrary pure states \(|\psi_1\rangle,|\psi_2\rangle\) is given by:

\[|\langle\psi_1|\psi_2\rangle|^2=h\int dxdp\rho_1(x,p)\rho_2(x,p)\]

where \(\rho_{1,2}(x,p)\) are the respective Wigner quasiprobability distributions for \(\hat{\rho}_{1,2}:=|\psi_{1,2}\rangle\langle\psi_{1,2}|\).

v) For a normalized pure state, \(|\rho(x,p)|\leq 2/h\) is bounded on phase space \((x,p)\).

Solution:

i) This simply follows from \(\langle\hat H\rangle=\text{Tr}(\hat{\rho}\hat H)\) together with the earlier identity for the trace of a product (one should check that the Wigner transform of the identity operator is indeed \(1\)).

ii) Using the \(2\) different forms of \(\rho(x,p)\) derived earlier for a pure state (one w.r.t. position space wavefunctions, one w.r.t. momentum space wavefunctions) the derivation is very simple:

(more generally, if \(\hat\rho\) is impure, then one can still say \(\int dp\rho(x,p)=\langle x|\hat{\rho}|x\rangle\) and \(\int dx\rho(x,p)=\langle p|\hat{\rho}|p\rangle\)).

iii) This simply stems from the Hermiticity of any density operator \(\hat{\rho}^{\dagger}=\hat\rho\) which is logically equivalent to the reality of \(\rho(x,p)\in\textbf R\).

iv) This follows from the identity \(|\langle\psi_1|\psi_2\rangle|^2=\text{Tr}(\rho_1\rho_2)\). In particular, if \(\langle\psi_1|\psi_2\rangle=0\) are orthogonal, then this shows that, despite just showing that it’s real, the Wigner quasiprobability distribution clearly cannot always be positive, even for simple pure states; this is the key difference between it and the positive semi-definiteness of the density operator \(\hat{\rho}\). It is also the reason for the quasi prefix, reflecting the fact that although its marginals \(|\psi(x)|^2,|\psi(p)|^2\) do have legitimate probabilistic interpretations, funnily enough the joint distribution \(\rho(x,p)\) does not admit such an interpretation.

v) This follows simply from the Cauchy-Schwarz inequality after doing a substitution \(x’\mapsto x’/2\) in the Wigner integral transform.

Problem: Show that the Wigner quasiprobability \(\rho(x,p,t)\) flows through phase space with respect to a Hamiltonian \(H\) according to the Moyal equation:

\[\dot{\rho}+\{\rho,H\}_{\star}=0\]

where \(i\hbar\{\rho,H\}_{\star}=\rho\star H-H\star\rho\) is the Moyal bracket, and the star product \((f\star g)(x,p)\) of two functions \(f(x,p),g(x,p)\) on phase space is defined by the somewhat peculiar formula:

\[(f\star g)(x,p):=f(x,p)\exp\left(\frac{i\hbar}{2}\left(\overleftarrow{\frac{\partial}{\partial x}}\overrightarrow{\frac{\partial}{\partial p}}-\overleftarrow{\frac{\partial}{\partial p}}\overrightarrow{\frac{\partial}{\partial x}}\right)\right)g(x,p)\]

where the arrows on top of the partial derivative operators denote the direction in which they are meant to act. Note that as a corollary:

\[\{\rho,H\}_{\star}=\frac{2}{\hbar}\rho\sin\left(\frac{\hbar}{2}\left(\overleftarrow{\frac{\partial}{\partial x}}\overrightarrow{\frac{\partial}{\partial p}}-\overleftarrow{\frac{\partial}{\partial p}}\overrightarrow{\frac{\partial}{\partial x}}\right)\right)H=\{\rho,H\}+O(\hbar^2)\]

Solution:

Problem: For the typical case of a Hamiltonian \(H(x,p)=\frac{p^2}{2m}+V_{\text{ext}}(x)\), what does the Moyal bracket simplify to?

Solution:

Problem: The previous results were all for a single particle moving in \(1\)D.

Posted in Blog | Leave a comment

Classical Hydrodynamics (Part 1)

Problem: At a high level, what is the goal of classical hydrodynamics?

Solution: The program of classical hydrodynamics seeks to bridge the physics of a many-body system at different length scales. The idea is to start from microscopics (i.e. Newton’s laws) and derive mesoscopics (i.e. Boltzmann equation), and from there to derive macroscopics (i.e. Navier-Stokes equations). Of course, there is also a field of quantum hydrodynamics, but that’s for another day…

Problem: In light of the above, it is useful to start with Newton’s laws, but in their Hamiltonian formulation. In order to coarse grain from this microscopic description to a mesoscopic description, the natural way to achieve such a coarse graining is to construct (i.e. pull out of thin air!) a probability density function \(\rho(\textbf x_1,…,\textbf x_N,\textbf p_1,…,\textbf p_N,t)\) defined on the joint phase space \(\cong\textbf R^{6N}\) of the \(N\)-body system. Give a more precise interpretation to \(\rho\).

Solution: The precise interpretation is that \(\rho(\textbf x_1,…,\textbf x_N,\textbf p_1,…,\textbf p_N,t)d^3\textbf x_1…d^3\textbf x_Nd^3\textbf p_1…d^3\textbf p_N\) is the probability (purely due to classical ignorance) that the system of \(N\) (in general not necessarily identical) particles is, at some time \(t\), living within an infinitesimal volume \(d^3\textbf x_1…d^3\textbf x_Nd^3\textbf p_1…d^3\textbf p_N\) centered around the (micro)state \((\textbf x_1,…,\textbf x_N,\textbf p_1,…,\textbf p_N)\) in the joint phase space.

Problem: State the equation of motion obeyed by \(\rho\) (i.e. Liouville’s equation).

Solution: Remembering the intuition that \(\{\space\space,H\}\) implements an advective derivative on the joint phase space (the essence of Hamilton’s equations):

\[\left(\frac{D\rho}{Dt}\right)_H=0\Rightarrow\dot{\rho}+\{\rho,H\}=0\]

so probability flows incompressibly throughout phase space under \(H\).

Problem: From this joint probability density function \(\rho\), explain how to marginalize to obtain the single-particle probability density function \(\rho_1(\textbf x,\textbf p,t)\) for say “particle #\(1\)”.

Solution: Simply integrate away the \(6(N-1)\) degrees of freedom of the other \(N-1\) particles:

\[\rho_1(\textbf x,\textbf p,t)=\int d^3\textbf x_2…d^3\textbf x_Nd^3\textbf p_2…d^3\textbf p_N\rho(\textbf x,…\textbf x_N,\textbf p,…,\textbf p_N,t)\]

Thus, \(\rho_1(\textbf x,\textbf p,t)d^3\textbf x d^3\textbf p\) represents the probability (again due to classical ignorance) that “particle \(1\)” will, at some time \(t\), be found in the infinitesimal cell \(d^3\textbf xd^3\textbf p\) centered around the state \((\textbf x,\textbf p)\) in its (single-particle) phase space.

Problem: Henceforth suppose (as is often the case) that the \(N\) particles are identical. Explain what the best way is to exploit this additional assumption of indistinguishability.

Solution: Rather than working with \(\rho_1\), work with \(n_1:=N\rho_1\); this sort of like a classical analog of the quantum mechanical passage from first quantization (Slater permanent/determinants) to second quantization (occupation numbers) in many-body quantum mechanics of identical bosons/fermions.

Problem: Derive the analog of the Liouville equation for \(n_1(\textbf x,\textbf p,t)\) (a.k.a. the Boltzmann equation):

\[\left(\frac{Dn_1}{Dt}\right)_{H_1}=(\dot n_1)_c\]

where the collision integral source term \((\dot n_1)_c\) is given by:

\[(\dot n_1)_c=\int d^3\textbf x_2 d^3\textbf p_2\frac{\partial V_{\text{int}}(\textbf x-\textbf x_2)}{\partial\textbf x}\cdot\frac{\partial n_2}{\partial\textbf p}\]

where \(n_2:=N(N-1)\int d^3\textbf x_3…d^3\textbf x_Nd^3\textbf p_3…d^3\textbf p_N\rho\). Specify the actual physics by assuming the Hamiltonian \(H\) has the generic dispersion:

\[H=\sum_{i=1}^NH_i+\sum_{1\leq i<j\leq N}V_{\text{int}}(\textbf x_i-\textbf x_j)\]

where \(H_i:=\frac{|\textbf p_i|^2}{2m}+V_{\text{ext}}(\textbf x_i)\), and in general for \(i=1\) one writes \(\textbf x:=\textbf x_1\) and \(\textbf p:=\textbf p_1\).

Solution: Using the Liouville equation for \(\rho\) as a springboard, the Boltzmann equation for \(n_1\) looks like:

Problem: In writing the Boltzmann equation for \(\dot n_1\) above, it was natural to introduce the quantity \(n_2(\textbf x,\textbf x_2,\textbf p,\textbf p_2,t)\). If one were to repeat the above derivation, namely evaluate \(\dot n_2\) using Liouville’s equation for \(\dot{\rho}\) again, what result would one find?

Solution: One would get a result most naturally expressed in terms of a quantity \(n_3:=N(N-1)(N-2)\int d^3\textbf x_4…d^3\textbf x_Nd^3\textbf p_4…d^3\textbf p_N\rho\), and so forth. This ladder of \(N\) equations forms the BBGKY hierarchy:

\[\frac{\partial n_k}{\partial t}=\{H_k,n_k\}+\sum_{i=1}^{k}\int d^3\textbf x_{k+1}d^3\textbf p_{k+1}\frac{\partial V_{\text{int}}(\textbf x_i-\textbf x_{k+1})}{\partial\textbf x_i}\cdot\frac{\partial n_{k+1}}{\partial\textbf p_i}\]

where the \(k\)-body distribution function is:

\[n_k(\textbf x_1,…\textbf x_k,\textbf p_1,…,\textbf p_k,t)=\frac{N!}{(N-k)!}\int d^3\textbf x_{k+1}…d^3\textbf x_Nd^3\textbf p_{k+1}…d^3\textbf p_N n(\textbf x_1,…,\textbf x_N,\textbf p_1,…,\textbf p_N,t)\]

and \(k\)-particle Hamiltonian \(H_k\) including both \(V_{\text{ext}}\) and interactions \(V_{\text{int}}\) among the first \(k\) particles but ignores interactions with the other \(N-k\) particles:

\[H_k=\sum_{i=1}^{k}\left(\frac{|\textbf p_i|^2}{2m}+V_{\text{ext}}(\textbf x_i)\right)+\sum_{1\leq i<j\leq k}V_{\text{int}}(\textbf x_i-\textbf x_j)\]

Problem: Define the real space number density \(n(\textbf x,t)\) in terms of \(n_1(\textbf x,\textbf p,t)\) and show that it’s not influenced by the collision integral.

Solution: One has \(n(\textbf x,t):=\int d^3\textbf p n_1(\textbf x,\textbf p,t)\). So:

\[\dot n=\int d^3\textbf p\{H_1,n_1\}+\int d^3\textbf p(\dot n_1)_c\]

but the \(d^3\textbf p\)-integral over the collision integral vanishes for the same kind of reasons as above:

Problem: What about for the momentum space number density \(n(\textbf p,t):=\int d^3\textbf x n_1(\textbf x,\textbf p,t)\)? What about for the real space momentum density \(\int d^3\textbf p\textbf pn_1\) or the real space kinetic energy density \(\int d^3\textbf p|\textbf p|^2n_1/2m\)?

Solution: In all these cases, the collision integral will contribute!

Problem: Show that a classical elastic \(2\)-body collision between equal-mass particles implies (though is not equivalent to) an \(O(3)\) isometry on their relative momentum.

Solution: Conservation of momentum:

\[\textbf p+\textbf p_2=\textbf p’_1+\textbf p’_2\]

\[|\textbf p+\textbf p_2|^2=|\textbf p’_1+\textbf p’_2|^2\]

\[|\textbf p|^2+|\textbf p_2|^2+2\textbf p\cdot\textbf p_2=|\textbf p’_1|^2+|\textbf p’_2|^2+2\textbf p’_1\cdot\textbf p’_2\]

Using conservation of kinetic energy for equal masses \(|\textbf p|^2+|\textbf p_2|^2=|\textbf p’_1|^2+|\textbf p’_2|^2\):

\[\textbf p\cdot\textbf p_2=\textbf p’_1\cdot\textbf p’_2\]

So:

\[|\textbf p|^2+|\textbf p_2|^2-2\textbf p\cdot\textbf p_2=|\textbf p’_1|^2+|\textbf p’_2|^2-2\textbf p’_1\cdot\textbf p’_2\]

\[|\textbf p-\textbf p_2|^2=|\textbf p’_1-\textbf p’_2|^2\]

\[|\textbf p-\textbf p_2|=|\textbf p’_1-\textbf p’_2|\]

Problem: Strictly speaking, the above is actually a precursor to the actual Boltzmann equation. So then, how does the actual Boltzmann equation arise from the BBGKY hierarchy? Make sure to clearly identify all assumptions.

Solution:

One thus obtains the actual Boltzmann (integrodifferential!) equation for \(n_1\), which still asserts that:

\[\left(\frac{Dn_1}{Dt}\right)_{H_1}=(\dot n_1)_c\]

but now with the collision integral expressed solely in terms of \(n_1\):

\[(\dot n_1)_c=\int d^3\textbf p_2d\sigma|\textbf v_{\text{rel}}|(n_1(\textbf x,\textbf p’_1,t)n_1(\textbf x,\textbf p’_2,t)-n_1(\textbf x,\textbf p,t)n_1(\textbf x,\textbf p_2,t))\]

where \(|\textbf v_{\text{rel}}|:=|\textbf p-\textbf p_2|/m=|\textbf p’_1-\textbf p’_2|/m\) is conserved, and in practice one should compute the differential cross-section \(d\sigma=\frac{d\sigma}{d\Omega}d\Omega\) from the interaction potential \(V_{\text{int}}\), and both post-collision momenta \(\textbf p’_1,\textbf p’_2\) are determined from \(\textbf p_1,\textbf p_2\) and \(d\Omega\) by elasticity.

On dimensional analysis grounds, the way to read the collision integral is as follows:

\[(\dot n_1)_c=\text{momentum}^3\times\frac{\text{volume}}{\text{time}}\frac{1}{\text{volume}^2\text{momentum}^6}=\frac{1/(\text{volume}\times\text{momentum}^3)}{\text{time}}\]

Problem: Elaborate more on the molecular chaos assumption and how it resolves Loschmidt’s paradox concerning the arrow of time \(t\).

Solution: The molecular chaos assumption essentially is the assertion that the pre-collision velocities are uncorrelated. The prefix “pre” is what breaks time reversal symmetry, despite the underlying Hamiltonian mechanics being time reversible.

Problem: Prove that the quantity:

\[-\frac{S(t)}{k_B}:=\int d^3\textbf xd^3\textbf pn_1\ln n_1\]

never increases with time \(t\).

Solution: Taking the time derivative:

\[-\frac{\dot S}{k_B}=\int d^3\textbf xd^3\textbf p\dot n_1(1+\ln n_1)\]

but \(N=\int d^3\textbf xd^3\textbf pn_1\) is conserved so that \(\dot N=\int d^3\textbf xd^3\textbf p\dot n_1=0\). Then of course, one should invoke Boltzmann’s equation:

\[=-\int d^3\textbf xd^3\textbf p\left(\frac{\textbf p}{m}\cdot\frac{\partial n_1}{\partial\textbf x}-\frac{\partial V_{\text{ext}}(\textbf x)}{\partial\textbf x}\cdot\frac{\partial n_1}{\partial\textbf p}\right)\ln n_1\]

\[+\int d^3\textbf xd^3\textbf pd^3\textbf p_2d\sigma|\textbf v_{\text{rel}}|(n_1(\textbf x,\textbf p’_1,t)n_1(\textbf x,\textbf p’_2,t)-n_1(\textbf x,\textbf p,t)n_1(\textbf x,\textbf p_2,t))\ln n_1(\textbf x,\textbf p,t)\]

Now integrate the first row above by parts:

\[\int d^3\textbf xd^3\textbf p\left(\frac{\textbf p}{m}\cdot\frac{\partial n_1}{\partial\textbf x}-\frac{\partial V_{\text{ext}}(\textbf x)}{\partial\textbf x}\cdot\frac{\partial n_1}{\partial\textbf p}\right)\ln n_1\]

\[=\int d^3\textbf p\frac{\textbf p}{m}\cdot\int d^3\textbf x\frac{\partial n_1}{\partial\textbf x}\ln n_1-\int d^3\textbf x\frac{\partial V_{\text{ext}}(\textbf x)}{\partial\textbf x}\cdot\int d^3\textbf p\frac{\partial n_1}{\partial\textbf p}\ln n_1\]

where:

\[\int d^3\textbf x\frac{\partial n_1}{\partial\textbf x}\ln n_1=\oint d^2\textbf x n_1\ln n_1-\int d^3\textbf x\frac{\partial n_1}{\partial\textbf x}=\oint d^2\textbf x n_1(\ln n_1-1)=0\]

and similarly for the other term.

To complete the proof, one just has to show that the collision integral term is \(\leq 0\). This is slightly more delicate, requiring one to first symmetrize between the particles \(\textbf p\Leftrightarrow\textbf p_2\) to obtain:

\[\frac{1}{2}\int d^3\textbf xd^3\textbf pd^3\textbf p_2d\sigma|\textbf v_{\text{rel}}|(n_1(\textbf x,\textbf p’_1,t)n_1(\textbf x,\textbf p’_2,t)-n_1(\textbf x,\textbf p,t)n_1(\textbf x,\textbf p_2,t))(\ln n_1(\textbf x,\textbf p,t)+\ln n_1(\textbf x,\textbf p_2,t))\]

and then further symmetrizing between past and future \((\textbf p,\textbf p_2)\Leftrightarrow(\textbf p’_1,\textbf p’_2)\) to get:

\[-\frac{1}{4}\int d^3\textbf xd^3\textbf pd^3\textbf p_2d\sigma|\textbf v_{\text{rel}}|(x-y)(\ln x-\ln y)\]

where \(x:=n_1(\textbf x,\textbf p’_1,t)n_1(\textbf x,\textbf p’_2,t)\) and \(y:=n_1(\textbf x,\textbf p,t)n_1(\textbf x,\textbf p_2,t)\). But the integrand is positive for all \(x,y>0\):

and “Boltzmann’s \(H\)-theorem” is proven.

Problem: A given single-particle phase space number density distribution \(n_1(\textbf x,\textbf p,t)\) is said to be in detailed balance iff for all locations \(\textbf x\in\textbf R^3\) and momenta \(\textbf p,\textbf p_2,\textbf p’_1,\textbf p’_2\in\textbf R^3\) obeying \(\textbf p+\textbf p_2=\textbf p’_1+\textbf p’_2\) and \(|\textbf p|^2+|\textbf p_2|^2=|\textbf p’_1|^2+|\textbf p’_2|^2\) (i.e. elastic collisions), one has the identity:

\[n_1(\textbf x,\textbf p,t)n_1(\textbf x,\textbf p_2,t)=n_1(\textbf x,\textbf p’_1,t)n_1(\textbf x,\textbf p’_2,t)\]

With this definition, show that the following statements are logically equivalent:

  1. \(n_1\) is in detailed balance.
  2. Collisions produce no entropy.
  3. \(\ln n_1\) is a collisional invariant, i.e.

\[\ln n_1(\textbf x,\textbf p,t)+\ln n_1(\textbf x,\textbf p_2,t)=\ln n_1(\textbf x,\textbf p’_1,t)+\ln n_1(\textbf x,\textbf p’_2,t)\]

  1. \(n_1(\textbf x,\textbf p,t)=n(\beta/2\pi m)^{3/2}\exp(-\beta|\textbf p-m\textbf v|^2/2m)\) for some functions \(n(\textbf x,t),\textbf v(\textbf x,t),\beta(\textbf x,t)\) (this is also called the Maxwell-Boltzmann distribution or, more profoundly, local equilibrium, which reduces to the usual thermodynamic case of global equilibrium iff \(n, \textbf v, \beta\) are constant parameters rather than spatiotemporally varying).

Solution: First, assume that \(n_1\) is in detailed balance (aside: the adjective detailed in this context is roughly synonymous with the adjective pairwise; it emphasizes that the balance doesn’t just come from e.g. \(5\) people standing in a circle and each person passing an apple to the person on their left, but that if Alice gives Bob \(3\) apples then Bob will also give Alice \(3\) apples). Then clearly, this is just the \(y=x\) diagonal in the graph! Thus, \(\dot S=0\). #2 is also clearly seen to imply #1 thanks to Boltzmann’s \(H\)-theorem. It is also trivial to check #1 is equivalent to #3.

In addition, \(n_1\) is in detailed balance if and only if \(\ln n_1\) is a collisional invariant. Finally, #3 can be argued to imply #4 by the fact that, since \(\ln n_1\) is a collisional invariant, it must be a linear combination of other collisional invariants \(\ln n_1\in\text{span}_{\textbf R}(1,\textbf p,|\textbf p|^2\}\), so there exist parameters \(\mu, \beta, \textbf v\) such that:

\[\ln n_1(\textbf x,\textbf p,t)=\beta(\mu+\textbf v\cdot\textbf p+|\textbf p|^2/2m)\]

\[n_1(\textbf x,\textbf p,t)=ze^{\beta(|\textbf p|^2/2m+\textbf v\cdot\textbf p)}\]

where the “fugacity” \(z:=e^{\beta\mu}\) is fixed by the normalization \(n=\int d^3\textbf pn_1\), and results in the form stated. The converse is of course trivial to show.

Posted in Blog | Leave a comment