Mixture of Experts (MoE)

3 분 소요

이 포스팅은 Mixture of Experts(MoE)의 개념을 정리하고, 전통적인 Ensemble과 비교하여 차이를 설명합니다.

Introduction

딥러닝과 머신러닝에서 성능을 높이는 강력한 방법 중 하나는 여러 모델을 조합하는 것입니다.
대표적인 방식이 앙상블(Ensemble) 이고, 또 다른 방식이 Mixture of Experts(MoE) 입니다.

이 둘은 “여러 모델을 쓴다”는 점에서는 비슷하지만,
동작 방식과 목표는 완전히 다릅니다.

Mixture of Experts (MoE)

MoE는 크게 두 부분으로 나눌 수 있습니다.

Experts
- 서로 다른 표현을 학습하는 여러 개의 전문가 모델.
- 예: $f_1(x), f_2(x), \dots, f_M(x)$
Gating Network
- 입력 $x$가 주어졌을 때, 어떤 Expert를 쓸지 결정하는 모듈.
- 보통 Softmax를 사용하여 가중치를 산출.

최종 출력은 다음과 같이 계산됩니다.

\[y(x) = \sum_{i=1}^M g_i(x) \, f_i(x)\]

$f_i(x)$ : $i$번째 Expert의 출력
$g_i(x)$ : 게이트가 결정한 가중치 (입력 조건부)

즉, 입력에 따라 Expert 선택이 달라진다는 점이 MoE의 핵심입니다. Visualization of MOE Architecture 출처: Link

Expert는 신경망에 국한되지 않는다

MoE라는 이름 때문에 흔히 “여러 개의 신경망을 두는 구조”라고 생각하기 쉽지만,
실제로 Expert는 어떤 종류의 모델이든 가능합니다.

통계 모델: 선형 회귀, 로지스틱 회귀, Gaussian Process
트리 기반 모델: Decision Tree, Random Forest, Gradient Boosting
전통 ML 기법: SVM, kNN, Naive Bayes
규칙 기반(rule-based) 모듈도 Expert가 될 수 있음

중요한 건 게이팅 네트워크가 입력 조건에 따라 Expert를 골라주는 구조이지, Expert 자체의 형태는 자유롭다는 것입니다.

📌 참고: 초기 MoE 논문(Jordan & Jacobs, 1994)에서도 Expert는 단순한 선형 회귀 모델이었으며,
현대 딥러닝에서는 신경망 Expert가 주로 사용되고 있습니다.

Ensemble

Ensemble은 여러 모델을 학습시킨 후,
예측 단계에서 모든 모델의 출력을 평균하거나 투표하는 방식입니다.

\[y(x) = \frac{1}{M} \sum_{i=1}^M f_i(x)\]

모든 입력이 모든 모델을 동일하게 통과합니다.
게이트 같은 조건부 선택은 없고, 단순히 전체 결과를 종합합니다.

즉, Ensemble은 모델의 다양성을 활용하여 일반화 성능을 높이는 방법이라고 볼 수 있습니다.

MoE vs Ensemble: 핵심 차이

구분	Ensemble	MoE
구조	여러 모델 + 단순 평균/투표	여러 Expert + Gating Network
Expert 사용 방식	모든 입력이 모든 모델을 거침	입력마다 일부 Expert만 선택
게이트(조건부 선택)	없음	있음
연산 효율	$M$개 모델 전부 계산 → 비용 ↑	선택된 $k$개 Expert만 계산 → 효율 ↑
목표	모델 예측 안정성·일반화	데이터 분포별 전문화된 처리
Expert 형태	대체로 유사한 구조	신경망, 회귀, 트리, 규칙 등 무엇이든 가능
해석 가능성	어떤 모델이 기여했는지 불분명	게이트가 선택한 Expert 확인 가능

직관적 이해

Ensemble: “여러 모델의 집단지성”
MoE: “게이트가 전문가에게 일을 나누어주는 조직”

즉, Ensemble은 무조건 다 같이 참여,
MoE는 상황별로 알맞은 전문가만 참여하는 구조입니다.

PyTorch 예제 (MLP Expert Ver.)

아래는 입력에 따라 Expert를 선택하는 간단한 MoE 구현입니다.

import torch
import torch.nn as nn
import torch.nn.functional as F

class Expert(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, out_dim)
        )
    def forward(self, x):
        return self.net(x)

class MoE(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim, n_experts=4, k=1):
        super().__init__()
        self.experts = nn.ModuleList([Expert(in_dim, hidden_dim, out_dim) for _ in range(n_experts)])
        self.gate = nn.Linear(in_dim, n_experts)
        self.k = k

    def forward(self, x):
        gate_logits = self.gate(x)                  # (batch, n_experts)
        topk_val, topk_idx = torch.topk(gate_logits, self.k, dim=-1)
        weights = F.softmax(topk_val, dim=-1)

        out = torch.zeros(x.size(0), self.experts[0].net[-1].out_features)
        for i in range(self.k):
            for b in range(x.size(0)):
                out[b] += weights[b, i] * self.experts[topk_idx[b, i]](x[b].unsqueeze(0))
        return out

구현

MoE는 일반 신경망처럼 End-to-End 학습이 가능합니다.
즉, 최종 출력 $y(x)$ 에 대해 Task Loss(예: 분류 → CrossEntropy, 회귀 → MSE)를 계산하고,
역전파를 통해 게이트 + Expert 파라미터를 동시에 업데이트합니다.

다만, 실제 학습에서는 몇 가지 이슈가 존재하며, 이를 해결하기 위한 기법이 자주 사용됩니다.

1. Task Loss

MoE의 기본 손실 함수는 일반 신경망과 동일합니다.

\[\mathcal{L}_{task} = \ell(y(x), \; y_{\text{true}})\]

2. Hard Routing 문제

MoE는 보통 Top-k Expert만 선택하는데, Top-k는 미분이 불가능하기에 역전파가 어렵습니다.

해결 방법:

Straight-Through Estimator (STE)
Soft Routing으로 시작 → Hard Routing으로 점진적 전환
Noisy Gating (확률적 선택)

3. Load Balancing Loss (불균형 방지)

게이트가 일부 Expert에만 몰리지 않도록, 보조 손실을 추가합니다.

\[\mathcal{L}_{balance} = M \sum_{i=1}^M p_i \cdot f_i\]

$p_i$: Expert $i$ 선택 확률의 평균
$f_i$: 실제 Expert $i$ 사용 비율

최종 손실은 다음과 같이 구성됩니다.

\[\mathcal{L} = \mathcal{L}_{task} + \lambda \, \mathcal{L}_{balance}\]

4. PyTorch 학습 루프 예시

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Expert(nn.Module):
    ...
class MoE(nn.Module):
    ...

def load_balance_loss(gate_logits):
    probs = F.softmax(gate_logits, dim=-1)   # (B, M)
    p_i = probs.mean(dim=0)                  # 평균 확률
    f_i = (probs > 0.5).float().mean(dim=0)  # 단순 선택 비율 (toy 예시)
    return (p_i * f_i).sum() * probs.size(1)

model = MoE(in_dim=10, hidden_dim=32, out_dim=2, n_experts=4, k=1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    x = torch.randn(64, 10)
    y = torch.randint(0, 2, (64,))

    out, gate_logits = model(x)
    task_loss = criterion(out, y)
    lb_loss = load_balance_loss(gate_logits)

    loss = task_loss + 0.01 * lb_loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print(f"[Epoch {epoch+1}] Task Loss={task_loss.item():.4f}, LB Loss={lb_loss.item():.4f}")

Conclusion

Ensemble은 단순히 여러 모델을 동시에 활용하는 기법.
MoE는 게이팅 네트워크를 두어 입력별로 Expert를 다르게 선택하는 기법.
Expert는 꼭 신경망일 필요가 없으며, 통계 모델이나 규칙 기반 모듈도 Expert가 될 수 있다.

따라서 MoE는 단순 Ensemble보다 효율적이고, 데이터 분포가 다양하거나 조건부 특성이 뚜렷한 문제에서 특히 강력합니다.

Reference

X Facebook LinkedIn Bluesky

Junhee Kim

추천 포스트

Mixture of Experts (MoE)

Introduction

Mixture of Experts (MoE)

Expert는 신경망에 국한되지 않는다

Ensemble

MoE vs Ensemble: 핵심 차이

직관적 이해

PyTorch 예제 (MLP Expert Ver.)

구현

1. Task Loss

2. Hard Routing 문제

3. Load Balancing Loss (불균형 방지)

4. PyTorch 학습 루프 예시

Conclusion

Reference

공유하기

댓글남기기

참고

[Paper Review] Sampling Can Be Faster Than Optimization

幻の命

Gibbs Posteriors

[Paper Review] Generalized Variational Inference: Three arguments for deriving new posteriors