[22’ ICLR] Towards a Unified View of Parameter-Efficient Transfer Learning

Date: 2024.06.24 Updated: 2024.06.24

카테고리: Language

🔍 Abstract

ICLR 2022 Spotlight를 받은 PEFT 방법론에 대한 분석 논문이다. 이 논문에서는 Serial Adapter, Prefix Tuning, LoRA를 재구조화하여 체계적으로 분석하고, 이들의 장점을 분석하여 Parallel Adapter와 MAM Adapter라는 새로운 PEFT 방법을 제안한다. 저자들의 깊이 있는 분석이 돋보이는 논문이다.

1. Preliminaries

1.1. Transformer Architecture

들어가기에 앞서, Transformer의 구조를 다시 한 번 상기해보자. Transformer block은 (1) multi-head self-attention과 (2) FFN(Feed-Forward Network)으로 구성되어 있다. 일반적인 attention function을 다음과 같이 나타낼 수 있다.

이때 queries $Q \in \mathbb{R}^ {n \times d_ k}$이고, key-value pairs $K \in \mathbb{R}^ {m \times d_ k}$, $V \in \mathbb{R}^ {m \times d_ v}$이다. Multi-head attention의 경우 $N_ h$ heads로 나누어 attention을 수행한다. 각 head projection을 $W_ q ^ {(i)}, W_ k ^ {(i)}, W_ v ^ {(i)} \in \mathbb{R}^ {d \times d_ k}$로 나타내고, query vector $x \in \mathbb{R}^ {d}$, attention을 수행할 $m$개의 vector $C \in \mathbb{R}^ {m \times d}$라 하면, MHA(Multi-Head Attention)은 다음과 같이 나타낼 수 있다.

이때 $W_ o \in \mathbb{R}^ {d \times d}$이다. 이렇게 계산된 $d$-dimensional vector를 FFN에 통과시켜 Transformer block을 완성한다.

이때 $W_ 1 \in \mathbb{R}^ {d \times d_ m}$, $W_ 2 \in \mathbb{R}^ {d_ m \times d}$이고, 일반적으로 $d_ m = 4d$로 설정한다. 이후 Residual Connection과 Layer Normalization을 거쳐 하나의 module이 완성된다.

1.2. PEFT Methods

다음으로 Adapter, Prefix Tuning, LoRA의 수식을 정리하여 보자. 먼저 adapter의 경우 transformer layer 사이에 작은 module을 추가하는 것으로, down-projection $W_ {\text{down}} \in \mathbb{R}^ {d \times r}$과 nonlinear function $f$, up-projection $W_ {\text{up}} \in \mathbb{R}^ {r \times d}$로 구성된다. 이때 $r$은 adapter의 bottleneck dimension이다.

한편 prefix tuning은 기존 key, value에 prefix vector $P_ k, P_ v \in \mathbb{R}^ {l \times d}$를 추가하는 것으로, $l$ 또한 bottleneck dimension으로 볼 수 있으며 MHA를 다음과 같이 나타낼 수 있다. 이때 $P_ k ^ {(i)}, P_ v ^ {(i)} \in \mathbb{R}^ {l \times d / N_ h}$이다.

마지막으로 LoRA는 low-rank decomposition을 사용하여 $\Delta W$를 $W_ {\text{down}} W_ {\text{up}}$로 나타낼 수 있다. 이때 $W_ {\text{down}} \in \mathbb{R}^ {d \times r}$, $W_ {\text{up}} \in \mathbb{R}^ {r \times d}$이다. 또한, tunable scalar hyperparameter $s \geq 1$을 추가할 수 있다.

2. A Unified View

2.1. A Closer Look at Prefix Tuning

한편, prefix tuning의 식을 정리하여 adapter와 유사한 방식으로 정리할 수 있다. 먼저 prefix tuning을 기존 attention 부분과 prefix 부분을 나누어 보자.

이를 정리하면,

이때 $\lambda {x}$는 prefix에 대한 normalized attention weight으로, 일종의 gating coefficient로 볼 수 있다.

$W_ 1 = W_ q P_ k ^ T, W_ 2 = P_ v$라고 하고, $f$를 softmax function이라고 하면 다음과 같이 쓸 수 있고, 이는 adapter의 식과 유사하다.

또한 $W_ 1 \in \mathbb{R}^ {d_ h \times l}$, $W_ 2 \in \mathbb{R}^ {l \times d_ h}$이기에, $l$이 bottleneck dimension이라는 직관과 맞아떨어진다. 대신 prefix tuning은 adapter와 3가지 차이점을 가진다.

$\Delta h$를 계산하기 위해 Prefix Tuning은 PLM layer input인 $x$를 사용하는 반면, adapter는 PLM layer output인 $h$를 사용한다. 따라서 Prefix Tuning은 parallel computation으로 볼 수 있고, adapter는 sequential computation으로 볼 수 있다.
Adapter는 Prefix Tuning보다 여러 module에 쉽게 추가할 수 있어 flexible하다. 즉, Adapter는 Attention과 FFN에 모두 추가할 수 있지만, Prefix Tuning은 Attention에만 추가할 수 있다.
Adapter는 single-head인 반면, Prefix Tuning은 multi-head attention head에 모두 추가할 수 있기에 더 expressive하다. 즉, adapter의 full-rank update는 $r \geq d$여야 가능한 반면, Prefix Tuning은 $l \geq d / N_ h$이기만 해도 가능하다.

2.2. Transferring Design Elements

따라서, 저자들은 이러한 내용들을 정리하고 지금까지 찾아지지 않은 새로운 PEFT 방법론들을 제안하였다.

Parallel Adapter: Adapter의 구조와 Prefix Tuning의 Parellel Characteristic을 결합한 방법이다.
Multi-head Parallel Adapter: Parallel Adapter에서 더 Prefix Tuning에 가까워진 방법으로, multi-head attention head에 모두 추가할 수 있다.
Scaled Parallel Adapter: Parallel Adapter에서 LoRA의 scaling $s$를 추가한 방법이다.

지금까지의 내용을 표로 정리하면 다음과 같다.

3. Experiments

3.1. Systemic Ablation Study

따라서 저자들은 insertion form, modified representation, composition function에 대한 ablation study를 수행하였다. 각각의 결론을 보자.

Insertion Form: Sequential vs Parallel

Insertion Form의 경우 parallel이 sequential보다 성능이 좋았다.

Modified Representation: Attention vs FFN

Modified Representation의 경우 FFN이 Attention보다 성능이 좋았다. 그러나, parameter budget이 아주 적은 상황, 즉 0.1%의 parameter만 tunable한 경우에는 반대로 head attention modification이 효과적이었다.

Composition Function: Simple vs Scaled

Composition Function의 경우 scaled가 simple보다 성능이 좋았다.

3.2. MAM Adapter

Systemic Ablation Study를 통해 3가지 교훈을 얻을 수 있었다.

Scaled Parallel Adapter가 FFN modification에 가장 효과적이다.
FFN은 large capacity 상황에서 modification하는 것이 효과적이다.
Head attention은 limited capacity 상황에서 modification하는 것이 효과적이다.

따라서 저자들은 이를 모두 종합하여 MAM Adapter를 제안하였다. 즉, 적은 parameter로 head attention prefix tuning을 시행하고 ($l =30$), 대부분의 parameter budget은 scaled parallel adapter로 FFN modification을 시행하는 방식이다($r = 512$). 이처럼 장점만을 통합하여 제안한 MAM Adapter는 다른 PEFT 방법들보다 우수한 성능을 보였고, 가장 full fine-tuning에 근접한 성능을 보였다.

💡 Summary

PEFT에 대한 체계적인 분석과 정리가 돋보이는 논문이다. Adapter, Prefix Tuning, LoRA를 재구조화하여 Parallel Adapter와 MAM Adapter를 제안한 점이 인상적이다.

📃 Reference

[22’ ICLR] Towards a Unified View of Parameter-Efficient Transfer Learning

I'm rubatoyeong