[Paper Review] Predictive variational inference: Learn the predictively optimal posterior distribution

8 분 소요

이 포스팅은 Lai, Linero, Yao 의 2024년 arXiv 논문
“Predictive variational inference: Learn the predictively optimal posterior distribution”
를 읽고 정리한 글입니다. 이 글에서 참고한 버전은 2026년 3월 30일 공개된 v3 입니다.

이전에 정리한 Variational Inference 와
Generalized Variational Inference 를 먼저 보고 오면 더 잘 연결됩니다.

Introduction

이 논문의 핵심 메시지는 아주 명확합니다.

모델이 틀릴 수 있다면, exact Bayes posterior 를 잘 근사하는 것 자체가 목표가 아닐 수 있다.

기존의 VI는 보통

\[q^\star(\theta)= \arg\min_{q\in\mathcal F} \mathrm{KL}\big(q(\theta)\,\|\,p(\theta\mid y)\big)\]

를 풀어서 Bayesian posterior 를 근사합니다.

하지만 논문 저자들은 문제를 제기합니다.

우리가 정말 원하는 것이 parameter recovery 인가?
아니면 좋은 posterior predictive distribution 인가?
그리고 모델이 misspecified 되어 있다면, exact posterior 자체가 이미 잘못된 target 아닌가?

Predictive Variational Inference (PVI) 는 이 질문에서 시작합니다.

PVI는 posterior 자체를 맞추는 대신,

\[q_\phi^Y(\tilde y)= \int_\Theta p(\tilde y\mid \theta)\,q_\phi(\theta)\,d\theta\]

라는 posterior predictive distribution 이 실제 데이터 생성 과정에 최대한 가깝도록 하는 $q_\phi(\theta)$ 를 학습합니다.

즉, VI가

“posterior approximation”

에 가깝다면,

PVI는

“predictive distribution optimization”

에 가깝습니다.

Standard VI 와 뭐가 다른가?

VI는 posterior를, PVI는 predictive를 target으로 삼는다

기존 VI의 ELBO는 다음과 같습니다.

\[\mathrm{ELBO}(q)= \mathbb E_q\left[\sum_{i=1}^n \log p(y_i\mid\theta)\right] -\mathrm{KL}\big(q(\theta)\,\|\,p^{\mathrm{prior}}(\theta)\big).\]

반면 PVI는 적절한 scoring rule $S$ 를 정한 뒤,

\[\phi^\star= \arg\max_{\phi\in\Phi} \left\{ \sum_{i=1}^n S\big(q_\phi^Y, y_i\big)- \lambda r(\phi) \right\}\]

를 풉니다.

여기서 핵심은 어디에 로그가 붙는가 입니다.

VI:
\[\mathbb E_q[\log p(y_i\mid\theta)]\]
PVI의 log-score 버전:
\[\log \int p(y_i\mid\theta)\,q_\phi(\theta)\,d\theta\]

즉,

VI는 posterior 평균을 낸 local likelihood fit
PVI는 mixing 후의 predictive distribution fit

을 본다는 점에서 다릅니다.

이 차이는 단순한 수식 위치 차이가 아니라, 무엇을 최적화하는지 자체가 다르다는 뜻입니다.

관점	Standard VI	PVI
목표	$p(\theta\mid y)$ 근사	$q_\phi^Y(\tilde y)$ 최적화
공간	parameter space	outcome space
기준	posterior divergence / ELBO	scoring rule 기반 predictive fit
misspecification 에서	정확한 posterior를 잘 근사해도 predictive가 나쁠 수 있음	predictive quality 자체를 바로 target으로 삼음

Proper Scoring Rule 로 predictive optimality 정의하기

PVI는 “좋은 예측”을 proper scoring rule 로 정의합니다.

논문에서 쓰는 기본 아이디어는 다음과 같습니다.

확률적 예측 분포 $P$ 에 대해 score 를 $S(P,\tilde y)$ 라고 하면, proper scoring rule 은 참분포 $p_{\mathrm{true}}$ 에서 최대가 되는 score 입니다.

이를 이용하면

\[D_S(P, p_{\mathrm{true}})= S(p_{\mathrm{true}}, p_{\mathrm{true}})- S(P, p_{\mathrm{true}})\]

처럼 outcome space 상의 divergence 를 정의할 수 있습니다.

즉 PVI는

parameter space 에서 posterior를 맞추는 대신,
outcome space 에서 predictive distribution을 맞춥니다.

논문에서 다루는 대표적인 score 는 다음 네 가지입니다.

1. Logarithmic score

\[S(P, y)=\log P(y)\]

가장 표준적인 score 입니다.
이 경우 PVI는 predictive log-likelihood 를 직접 최적화합니다.
KL 관점과 연결되지만, Bayes posterior approximation 과는 여전히 다릅니다.

2. Quadratic score / Brier score

categorical outcome 에 적합합니다.
log score 보다 calibration 차이를 더 민감하게 볼 수 있습니다.

3. Interval score

prediction interval 의 sharpness + coverage 를 함께 봅니다.
예측구간이 좁으면서도 데이터가 그 안에 잘 들어오는지를 평가합니다.

4. CRPS

예측분포 전체의 calibration 을 보는 score 입니다.
논문에서 특히 중요한 이유는, likelihood 값을 계산하지 못해도 posterior predictive sample 만 있으면 쓸 수 있기 때문입니다.
그래서 PVI는 simulation-based inference (SBI) 나 likelihood-free setting 에도 바로 연결됩니다.

Regularization: prior 쪽으로 붙일지, posterior 쪽으로 붙일지

논문은 PVI objective 에 regularization 항을 넣습니다.

두 가지 선택이 있습니다.

\[r^{\mathrm{prior}}(\phi)= \mathrm{KL}\big(q_\phi(\theta)\,\|\,p^{\mathrm{prior}}(\theta)\big)\] \[r^{\mathrm{post}}(\phi)= \mathrm{KL}\big(q_\phi(\theta)\,\|\,p(\theta\mid y)\big)\]

해석은 다음과 같습니다.

Prior regularization
- standard VI 에서 보던 prior regularization 과 비슷합니다.
- predictive fit 을 보되, prior 에서 너무 멀어지지 않게 합니다.
Posterior regularization
- exact posterior 쪽으로 당기는 regularization 입니다.
- $\lambda$ 를 키우면 PVI가 점점 Bayes/VI 쪽으로 가까워집니다.

즉 PVI는

$\lambda=0$ 이면 pure predictive optimization
$\lambda\to\infty$ 이면 Bayes 쪽

으로 이해할 수 있습니다.

이 부분이 좋은 이유는, PVI를 “Bayes를 완전히 버리는 방법”이 아니라

predictive robustness 와 Bayesian regularization 사이를 조절하는 방법

으로 볼 수 있게 해준다는 점입니다.

가장 중요한 해석: PVI는 implicit hierarchical expansion 이다

이 논문에서 가장 인상적인 부분은, PVI posterior 의 분산이 줄어들지 않는 현상을 “VI approximation error” 가 아니라 “parameter heterogeneity” 로 해석한다는 점입니다.

Normal example

논문은 아주 간단한 예를 듭니다.

모델은

\[y\mid \theta \sim \mathcal N(\theta, 1)\]

인데, 실제 데이터 생성 과정은

\[y \sim \mathcal N(0, 2)\]

라고 합시다.

이때 exact Bayes posterior 는 데이터가 많아질수록 $\theta=0$ 주변의 점질량으로 수렴합니다. 즉, regular Bayes 는

“하나의 고정된 $\theta$ 가 있다”

는 complete pooling view 를 끝까지 유지합니다.

그런데 PVI는 다르게 움직입니다. 논문에 따르면 이 경우 PVI posterior 는

\[q(\theta)\to \mathcal N(0,\sqrt{3})\]

로 수렴합니다.

왜냐하면 이 분포를 mixing 하면 posterior predictive 가

\[\tilde y \sim \mathcal N(0, 2)\]

를 정확히 재현할 수 있기 때문입니다.

즉 PVI는

“모델이 틀렸으니 uncertainty 가 0 으로 가면 안 된다”
“오히려 이 분산은 population-level heterogeneity 를 설명하는 분산일 수 있다”

고 해석합니다.

이 관점에서 저자들은 PVI를

explicit hierarchical Bayes 를 직접 쓰지 않고도,
parameter population distribution 을 학습하는 implicit hierarchical expansion

으로 봅니다.

이 해석은 상당히 강합니다.

standard Bayes: 하나의 true parameter 에 수렴
PVI: 필요하다면 parameter distribution 자체에 수렴

즉 misspecification 하에서는 posterior variance 가 “사라져야 할 approximation artifact” 가 아니라 “남아 있어야 하는 signal” 이 됩니다.

Heterogeneity Detection

이제 위 해석을 practical 하게 쓰면, PVI posterior 의 분산은 model check 도구가 됩니다.

논문의 주장은 대략 이렇습니다.

잘 지정된 모델이면 PVI도 결국 좁은 posterior 로 수렴
모델이 틀렸거나 어떤 parameter 가 population 에서 실제로 변하면, PVI posterior 가 넓게 남을 수 있음

그래서

PVI variance 가 크면, 해당 parameter 를 varying effect 나 hierarchical effect 로 확장해야 할 가능성

을 시사합니다.

물론 논문도 이 부분의 한계를 같이 인정합니다.

finite sample 에서는 variance 가 큰 이유가 정말 misspecification 때문인지, 아니면 PVI의 느린 수렴률 때문인지 구분이 쉽지 않습니다.
따라서 heterogeneity detection 은 좋은 heuristic 이지만, 자동화된 정답 판정기로 받아들이면 위험합니다.

그래도 “posterior가 넓으면 모델을 더 유연하게 바꿔라” 라는 메시지는 실전 모델링에서 꽤 유용합니다.

구현 측면에서 무엇이 새롭나?

PVI는 아이디어만 보면 간단하지만, 실제로는 gradient 계산이 쉽지 않습니다.

특히 log score 에서는

\[\log \int p(y_i\mid\theta)\,q_\phi(\theta)\,d\theta\]

를 직접 최적화해야 하므로, Monte Carlo 근사와 로그가 섞이면서 bias 문제가 생깁니다.

논문은 여기서 두 가지 포인트를 제안합니다.

1. Log score 에 대한 rejection-sampling 기반 unbiased gradient

저자들은 finite minibatch 에서 생기는 bias 를 줄이기 위해 rejection sampling 을 이용한 unbiased gradient estimator 를 제안합니다.

이 부분은 단순히 “PVI라는 철학”만 이야기하는 논문이 아니라,

실제로 stochastic optimization 이 가능하도록 gradient estimator 까지 설계한 논문

이라는 점에서 중요합니다.

2. CRPS 기반 likelihood-free optimization

CRPS는 density evaluation 이 아니라 simulation 만 있으면 쓸 수 있기 때문에, likelihood-free model 에 바로 적용됩니다.

즉,

posterior sample $\theta \sim q_\phi(\theta)$ 를 뽑고
simulator 에 넣어 $y^{\mathrm{sim}} \sim p(y\mid\theta)$ 를 생성한 뒤
관측 데이터와 simulation 사이의 discrepancy 를 줄이도록 학습

할 수 있습니다.

이건 ABC, neural SBI 류 문제와도 자연스럽게 연결됩니다.

Experiments

논문은 세 가지 대표 실험을 보여줍니다.

1. Golf putting: misspecification detection

Gelman and Nolan (2002)의 골프 퍼팅 데이터에서 logistic regression 과 geometric model 을 비교합니다.

misspecified 된 logistic regression 에서는 Bayes posterior 가 너무 좁고 잘못된 predictive curve 를 고릅니다.
PVI는 posterior 를 더 넓게 잡아 predictive fit 을 개선합니다.
반대로 geometric model 처럼 잘 맞는 경우에는 Bayes 와 PVI가 거의 같은 결론으로 갑니다.

즉,

PVI posterior 가 넓게 남는 현상 자체가 misspecification signal

이라는 것을 보여주는 예시입니다.

2. U.S. election analysis: model expansion guide

미국 대선 turnout 데이터를 이용한 다층 로지스틱 회귀 예제에서는, 어떤 interaction 이 더 필요할지를 PVI variance 로 진단합니다.

VI는 misspecified model 에서도 계속 concentrate 합니다.
PVI는 특정 state 에서 variance 가 더 크게 남는 것을 보여줍니다.
저자들은 이를 근거로 “이 state 들은 coefficient 가 population 에서 vary 해야 한다” 고 해석합니다.

즉 PVI는 단순 예측 개선을 넘어서,

어떤 방향으로 hierarchical model 을 확장할지 알려주는 진단 도구

로 쓰입니다.

3. CryoEM: likelihood-free inference

이 논문의 응용 측면에서 가장 강한 부분입니다.

cryoEM에서는

simulator 는 있지만
likelihood 는 intractable 하고
게다가 분자 구조 자체가 population 내에서 heterogeneous 합니다.

이 상황에서 exact Bayes posterior 는 샘플 수가 많아질수록 오히려 과신한 point estimate 로 붕괴할 수 있습니다.

반면 PVI-CRPS 는

likelihood 계산 없이
simulator 기반으로
분자 opening angle 의 population distribution 을 직접 학습합니다.

즉 이 예제에서는 $q_\phi(\theta)$ 가 단순한 variational artifact 가 아니라, 실제로 물리적으로 의미 있는 population distribution 으로 해석됩니다.

Generalized VI 와의 차이

이 논문은 기존 generalized Bayes / generalized VI 와도 비교합니다.

표면적으로 보면 둘 다

loss 를 바꾸고
divergence 를 바꾸고
Bayes를 일반화한다

는 점에서 비슷해 보입니다.

하지만 논문의 주장은 분명합니다.

PVI는 generalized VI와 “정신적으로” 다른 프레임워크다.

왜냐하면 generalized VI는 여전히

$(y,\theta)$ space 의 loss
$\theta$ space 의 divergence

를 중심으로 posterior 를 정의하는 반면, PVI는

posterior predictive distribution
outcome-space scoring rule

을 중심으로 objective 를 짜기 때문입니다.

특히 log score 를 쓰더라도

generalized VI: $\mathbb E_q[\log p(y\mid\theta)]$
PVI: $\log \int p(y\mid\theta)q(\theta)d\theta$

라서 둘은 일반적으로 같지 않습니다.

이 차이 때문에 PVI는

continuous stacking 혹은
continuous predictive model averaging

처럼 읽는 편이 더 자연스럽습니다.

Limitations

논문이 인정하는 한계도 분명합니다.

score 선택이 중요합니다.
- 어떤 predictive property 를 중요하게 보는지에 따라 결과가 달라집니다.
- log score, interval score, CRPS 는 서로 다른 목적을 가집니다.
IID / exchangeable 가정에 기대고 있습니다.
- 논문 전개는 기본적으로 conditionally IID setting 위에 서 있습니다.
- 시계열, dependent data, structured data 로 가면 확장이 더 필요합니다.
finite-sample 해석은 아직 조심해야 합니다.
- posterior variance 가 큰 이유를 misspecification 과 느린 수렴 중 무엇으로 볼지 모호할 수 있습니다.
parameter truth 보다는 predictive truth 에 더 가깝습니다.
- 따라서 해석의 중심은 “이 posterior가 진짜 posterior 인가?” 가 아니라 “이 posterior가 predictive distribution 을 잘 만들도록 돕는가?” 에 있어야 합니다.

Summary

제가 읽고 느낀 이 논문의 핵심은 네 가지입니다.

PVI는 Bayes posterior approximation 이 아니라 predictive optimization 이다.
- 이 점을 놓치면 논문의 핵심을 거의 놓치게 됩니다.
misspecification 하에서는 posterior variance 가 줄어들지 않는 것이 오히려 장점일 수 있다.
- 이 분산은 latent heterogeneity 나 model expansion 필요성을 드러냅니다.
PVI는 implicit hierarchical Bayes 로 읽을 수 있다.
- “parameter 하나” 대신 “parameter population” 을 학습하는 관점입니다.
CRPS 덕분에 likelihood-free inference 로 자연스럽게 확장된다.
- 이 부분은 SBI 문맥에서 특히 매력적입니다.

개인적으로는 이 논문을

“VI를 predictive scoring rule 관점으로 다시 세운 논문”

이라기보다,

“Bayesian inference의 목표를 posterior accuracy 에서 predictive adequacy 로 옮겨 놓은 논문”

으로 읽는 것이 더 정확하다고 느꼈습니다.

정리하면 PVI는

exact posterior 를 근사하는 또 하나의 VI 변형이라기보다
misspecified world 에서 어떤 uncertainty 를 남겨야 하는가 를 다시 묻는 프레임워크

에 가깝다고 할 수 있습니다.

References

Lai, J., Linero, A., & Yao, Y. (2024; revised 2026-03-30). Predictive variational inference: Learn the predictively optimal posterior distribution. arXiv:2410.14843.

X Facebook LinkedIn Bluesky

Junhee Kim

추천 포스트

[Paper Review] Predictive variational inference: Learn the predictively optimal posterior distribution

Introduction

Standard VI 와 뭐가 다른가?

VI는 posterior를, PVI는 predictive를 target으로 삼는다

Proper Scoring Rule 로 predictive optimality 정의하기

1. Logarithmic score

2. Quadratic score / Brier score

3. Interval score

4. CRPS

Regularization: prior 쪽으로 붙일지, posterior 쪽으로 붙일지

가장 중요한 해석: PVI는 implicit hierarchical expansion 이다

Normal example

Heterogeneity Detection

구현 측면에서 무엇이 새롭나?

1. Log score 에 대한 rejection-sampling 기반 unbiased gradient

2. CRPS 기반 likelihood-free optimization

Experiments

1. Golf putting: misspecification detection

2. U.S. election analysis: model expansion guide

3. CryoEM: likelihood-free inference

Generalized VI 와의 차이

Limitations

Summary

References

공유하기

댓글남기기

참고

Approximate Bayesian Computation Posterior

[Paper Review] Sampling Can Be Faster Than Optimization

幻の命

Gibbs Posteriors