A Story About Stumbling on the Proof that the M-H Algorithm in Hoff/AFCBSM Works

Introduction

Regarding the proof presented in Hoff (2009) and Hoff et al. (2022), Section 10.4.2 “Why does the Metropolis-Hastings algorithm work?”, I encountered a point of confusion, so I decided to write about it to organize my thoughts.

Hoff’s Proof Flow

In Hoff (2009) and Hoff et al. (2022), the proof explaining why the Markov chain generated by the M-H algorithm can approximate the target distribution \(p_0\) proceeds as follows:

The M-H algorithm generates a Markov chain that is irreducible, aperiodic, and positive recurrent.
By the Ergodic Theorem, as \(s \to \infty\), there uniquely exists a \(\pi\) such that
- \(\mathrm{Pr}(x^{(s)} \in A) \to \pi(A)\) for any set \(A\);
- \(\frac{1}{S} \sum g(x^{(s)}) \to \int g(x) \pi(x) dx\).
Such a \(\pi\) is called the stationary distribution and has the following property:

If \(x^{(s)} \sim \pi\),
and \(x^{(s+1)} \) is generated from the Markov chain starting at \(x^{(s)}\),
then \(\mathrm{Pr}(x^{(s+1)} \in A) = \pi(A)\).

(Once sampled from the stationary distribution, subsequent samples are also sampled from the stationary distribution.)
\(\pi(x) = p_0(x)\) holds.

From 1-4, the Markov chain generated by the M-H algorithm can approximate the target distribution \(p_0\).

Point of Confusion

Looking at the proof for step 4: \(\pi(x) = p_0(x)\) in the above flow, it states:

Suppose \(x^{(s)}\) is sampled from the target distribution \(p_0\), and assume \(x^{(s+1)}\) is generated using the Metropolis-Hastings algorithm starting from \(x^{(s)}\).

The proof proceeds based on this assumption. At this point, I thought, “Isn’t there no guarantee that this assumption holds?” This is because what step 2 guarantees is that the values sampled from the Markov chain will eventually be sampled from \(\pi\), not necessarily from the target distribution \(p_0\).

Therefore, the proof of \(\pi(x) = p_0(x)\), which is based on an assumption that hasn’t been shown to hold, seems logically disconnected from steps 1-3 of this overall proof. This led me to question whether this actually proves that “the Markov chain generated by the M-H algorithm can approximate the target distribution \(p_0\)”.

How I Came to Understand

The answer was simple: I understood it by considering the property of the stationary distribution of the Markov chain in step 3 as its definition.

Initially, I considered the property in step 3 as a necessary condition for being a stationary distribution, thinking the converse wasn’t necessarily true, which caused the confusion. Incidentally, the textbook introduces this as a property as follows:

The distribution \(\pi\) is called the stationary distribution of the Markov chain. This name comes from the following property: If \(x^{(s)} \sim \pi\), and \(x^{(s+1)}\) is generated from the Markov chain starting at \(x^{(s)}\), then \(\mathrm{Pr}(x^{(s+1)} \in A) = \pi(A)\) holds.

This wording implies that if it’s a stationary distribution, it satisfies this property, but it doesn’t explicitly state that satisfying this property makes it a stationary distribution, right?

So, how is a stationary distribution defined in the first place? For example, in 中妻照雄 (2019) and 古澄 (2015), the stationary distribution of a Markov chain is defined as follows:

The stationary distribution of a Markov chain is a distribution \(\bar{f}\) that satisfies \[ \bar{f} = \bar{f} \circ K \] for the transition kernel \(K\).

If we consider the statement in step 3 as this definition expressed without using the transition kernel, then Hoff’s proof seems consistent, which cleared things up for me.

How Other Textbooks Prove It

Regarding the proof that the Markov chain generated by the M-H algorithm can approximate the target distribution \(p_0\), I found the proof in 中妻照雄 (2019) (by the esteemed head of our Keio Economics Research Department) to be the clearest. Omitting the detailed proof, it proceeds as follows:

The transition kernel of the Markov chain generated by the M-H algorithm satisfies the detailed balance condition.
A Markov chain has an invariant distribution with respect to any distribution that satisfies the detailed balance condition.

Although Hoff (2009) doesn’t mention the detailed balance condition, I personally found the proof in 中妻照雄 (2019) more direct and intuitive.

Summary

In this article, I discussed the proof of why the M-H algorithm works as presented in Hoff (2009) and Hoff et al. (2022). I introduced the flow of the proof in the textbook and mentioned the point where I got stuck. I also introduced 中妻照雄 (2019) as an alternative proof method.

If there are any errors in the claims made in this article, I would be very grateful if you could contact me. Thank you for reading to the end.

References

Hoff, Peter D. 2009. A First Course in Bayesian Statistical Methods. 1st ed. Springer Texts in Statistics. New York, N.Y: Springer.

Hoff, Peter D., 入江薫, 菅澤翔之助, and 橋本真太郎. 2022. 標準ベイズ統計学. 東京: 朝倉書店.

中妻照雄. 2019. Pythonによるベイズ統計学入門. 実践Pythonライブラリー. 朝倉書店. https://books.google.co.jp/books?id=XBCIxAEACAAJ.

古澄英男. 2015. ベイズ計算統計学. 統計解析スタンダード. 東京: 朝倉書店.

Introduction#

Hoff’s Proof Flow#

Point of Confusion#

How I Came to Understand#

How Other Textbooks Prove It#

Summary#

References#