Language Model Post List

Date: 2024.06.25 Updated: 2024.08.13

카테고리: Listup

📏 Alignment Tuning

1. RLHF

[19’] RLHF: Fine-Tuning Language Models from Human Preferences
- Human Preference Alignment를 위해 Reward Model을 학습하고, 이후에 Policy를 학습하는 RLHF 방법론을 처음으로 제시함
- RLHF 방법으로 PPO(Proximal Policy Optimization)을 사용하여 학습함
[23’ NIPS] Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- 훈련 과정이 복잡한 PPO 대신 RL 없이 더 성능이 좋고 더 잘 수렴하는 DPO(Direct Preference Optimization) 방법론을 제시함
[24’ ICML] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
- 기존 논문들이 PPO보다 DPO를 선호하지만, 역설적으로 DPO가 PPO보다 OOD 데이터를 선호하여 성능이 떨어지는 문제를 발견함
- PPO의 성능을 높이는 몇 가지 테크닉을 제시하여 PPO가 DPO보다 우월함을 입증함

2. RLAIF

[22’] Constitutional AI: Harmlessness from AI Feedback
- RLAIF의 개념을 처음으로 제안함
[23’] RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- Human Preference가 아닌 AI Preference를 사용해도 RLHF와 유사한 성능을 낼 수 있음
- Direct RLAIF라는 방법론을 제시하였으며, 이는 기존 RLHF Pipeline보다 간단하고 성능이 좋음
[24’ ICML] Self-Rewarding Language Models
- 더 이상 human preference를 사용하지 않고, LLM 자기 자신의 feedback만으로도 LLM을 학습시킬 수 있는 방법론을 제시함

🔥 Parameter-Efficient Model Adaptation

1. PEFT Method

[Summary] Brief Summary of Parameter-Efficient Fine-Tuning for Language Models
- 중요한 PEFT 방법인 Adapter Tuning, Prefix Tuning, Prompt Tuning, LoRA에 대해 간략하게 요약한 글
[19’ ICML] Parameter-Efficient Transfer Learning for NLP
- PEFT 방법으로는 거의 최초인 Serial Adapter를 처음으로 제안함
[22’ ICLR] Towards a Unified View of Parameter-Efficient Transfer Learning
- Adapter, Prefix Tuning, LoRA를 하나의 관점으로 통합하고 이를 통해 Parallel Adapter, MAM Adapter를 제안함
[21’ ACL] Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Prefix sequence를 각 Transformer layer에 추가하는 Prefix Tuning을 제시함, Deep Prompt Tuning으로 해석되기도 함
[21’ EMNLP] The Power of Scale for Parameter-Efficient Prompt Tuning
- Input에 tunable soft prompt를 추가하는 Prompt Tuning을 제시함
[21’] GPT Understands, Too
- Prompt Tuning과 거의 유사한 방법으로, 이 논문을 통해 GPT가 NLU(Natural Language Understanding) task를 수행할 수 있음을 입증함
[22’ ACL] P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- Multi-scale, Multi-task에서도 Prefix Tuning이 적용 가능하다는 것을 입증함
[22’ ICLR] LoRA: Low-Rank Adaptation of Large Language Models
- LLM Fine-tuning을 low-dimension reparameterization으로 수행할 수 있다는 것을 보이고, 이를 통해 LoRA(Low-Rank Adaptation) 방법론을 제시함
[24’] A Note on LoRA
- LoRA를 개발한 Microsoft에서 LoRA에 대한 추가적인 Insight를 제공하는 일종의 아티클
[24’] LoRA Learns Less and Forgets Less
- LoRA가 Full Fine-tuning에 비해 Performance는 떨어지지만 Regularization의 효과가 크다는 것을 보임
[24’ ICLR] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- LLaMA에 적용 가능한 Prefix Tuning Variant를 제시하였으며, 초반 학습에서 Zero-initialized Attention을 사용하는 것이 중요하다는 것을 보임
- Multimodal LLaMA-Adapter도 개발하여 Multimodal LLM의 일종으로 분류되기도 함

2. PEFT Analysis

[Summary] Analysis of Parameter-Efficient Fine-Tuning Techniques for Large Language Models
- 세 개의 체계적인 분석 논문을 바탕으로 PEFT 방법을 비교 및 분석한 글, 세 논문은 다음과 같음
- [23’ Nature Machine Intelligence] Parameter-efficient fine-tuning of large-scale pre-trained language models
- [23’ ICLR Workshop] Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs
- [23’ EMNLP] LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

🤔 Chain-of-Thought

1. Chain-of-Thought

[22’ NIPS] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Chain-of-Thought 개념을 처음으로 제안함, 이를 Manual-CoT라고도 함
[22’ NIPS] Zero-shot-CoT: Large Language Models are Zero-Shot Reasoners
- “Let’s think step by step”이라는 prompt를 사용하여 LLM이 zero-shot으로 reasoning을 수행할 수 있음을 입증함, 이를 Zero-shot-CoT라고도 함
[23’ ICLR] Auto-CoT: Automatic Chain of Thought Prompting in Large Language Models
- Zero-shot-CoT를 사용해 Few-shot sample을 만든 뒤 Manual-CoT를 진행하여, 모든 CoT 과정을 자동화하는 Auto-CoT를 제시함

2. Others

[24’] Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding
- 하나의 LLM이 (1) 스스로 복잡한 task를 분할하여 각 expert에 할당하는 meta model과 (2) expert model의 역할을 모두 하는 Meta-Prompting을 제시함, 이는 집단지성과 유사한 방식

🕶 Mechanical Interpretability

1. Attribution Analysis

Token-wise Analysis

[19’ ACL-WS] What Does BERT Look At? An Analysis of BERT’s Attention
- BERT에서 처음으로 Attention Weight을 분석, 각 Head의 역할을 해석
- 불필요한 경우 [SEP]과 같은 deliminator token에 높은 Attention을 보이는 경향이 있다는 것을 밝힘; 이로부터 no-op hypothesis 제시
[20’ EMNLP] Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
- Attention Weight만으로는 Model Behavior를 설명하는 데 부족하다는 문제를 제시하고, Value Vector를 고려한 Norm-based Analysis를 제안
- No-op hypothesis를 Value Vector를 통해 다시 해석하여 Attention Weight과 Value Vector가 서로 Cancel하여 무의미한 Vector를 생성한다는 것을 밝힘
[21’ EMNLP] Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
- Attention Block 전체를 고려한 ATTNRESLN을 제안
- Context-mixing effect와 Preserving effect를 분석하여 Attention의 Contribution이 그렇게 크지 않다는 것을 밝힘
[24’ ICLR] Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps
- Non-linear Function인 FFN을 Integrated Gradients를 통해 Linearlize하고, 이를 사용한 해석 방법인 ATBFFRESLN을 제안
- FFN이 Input Contextualization을 Modify하여 특정 Linguistic Composition을 강조한다는 것을 밝힘
- Residual Connection, Layer Normalization에 의해 FFN의 역할이 희석된다는 것을 밝힘

Layer-wise Analysis

[23’ EMNLP] Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
- LLM의 In-Context Learning을 Label Word를 중심으로 설명하고, 이를 통해 LLM의 작동 원리를 이해하고자 함
- Shallow Layer에서는 Few-shot sample의 Semantic Information이 Label Word Token에 aggregation되고, Deep Layer에서 Label Word Token이 Reference로 사용되어 LLM의 최종 output을 결정한다는 가설을 제시하고 증명함
[24’] Not All Layers of LLMs Are Necessary During Inference
- Inference 시에 Layer Early Stopping을 통해서도 거의 동일한 성능을 얻을 수 있으며, 어려운 task일수록 더 많은 layer가 필요함
- Early Stopping을 언제 진행할지 결정하기 위해 가장 중요한 Feature는 Token Probability이며, 이를 활용하여 AdaInfer를 제안함
[24’ ICML-WS] The Remarkable Robustness of LLMs: Stages of Inference?
- LLM의 Intermediate Layer는 Layer Swap, Ablation에 Robust하다는 것을 발견하고, 이것이 Residual Connection 덕분이라고 주장함
- LLM이 Layer에 따라 4단계의 Inference Stage (1) detokenization, (2) feature engineering, (3) prediction ensembling, (4) residual sharpening를 거친다는 가설을 제시함

2. Attention & Probability

Contrastive Decoding

[23’ ACL] Contrastive Decoding: Open-ended Text Generation as Optimization
- Expert LM과 Amateur LM의 Logit을 비교하여 Decoding하는 Contrastive Decoding이라는 새로운 방식을 제안하였고, 이는 다른 Decoding 방식보다 훨씬 효과적임
- False Negative, False Positive를 줄이기 위해 High Probability Token에 대해서만 Decoding을 진행하는 APC(Adaptive Plausibility Constraints)를 도입하였고, 이는 성능에 굉장히 중요함
[23’] Contrastive Decoding Improves Reasoning in Large Language Models
- Contrastive Decoding의 정도를 조절할 수 있는 새로운 hyperparameter $\beta$를 도입하여 일반화된 Contrastive Decoding 방식을 제안함
- Reasoning Task에서 CD가 효과적임을 보여주었고, CD의 장단점을 분석함
[23’] Alleviating Hallucinations of Large Language Models through Induced Hallucinations
- Hallucination 문제를 해결하기 위해 Hallucination을 유도하는 Data를 생성하고 이를 사용해 Fine-tuning한 LLM을 사용하여 Contrastive Decoding을 수행하는 Induce-then-Contrast Decoding을 제안함
- Induce-then-Contrast Decoding은 Hallucination을 줄이는 데 효과적이었으며, 다른 방법들보다 성능이 좋고 강건함
[24’ NAACL] Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
- Input Context를 잘 고려하지 않아 Hallucination이 일어나는 문제를 해결하기 위해 Context가 없는 LM과 대조하여 Context를 증폭시키는 Context-aware Decoding을 제안함
[24’ NAACL] Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding
- Non-parametric knowledge만 중요시했던 CAD와는 달리 parametric knowledge와 non-parametric knowledge가 모두 중요하다고 주장하며 Multi-Input Contrastive Decoding을 제안함
- Dynamic $\alpha$를 사용하여 Parametric Knowledge와 Non-parametric Knowledge의 confidence에 따라 Contrastive Decoding의 가중치를 조절함
[24’] Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts
- Context가 Noisy한 경우에 Context가 최종 답변에 혼란을 줄 수 있다는 가정 하에 혼란을 주는 경우 가중치를 줄이는 Adaptive Contrastive Decoding을 제안함
[24’ ICLR] DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
- Factual Knowledge는 중간의 특정 Layer에서 얻어지며, 따라서 Early Layer와 Late Layer 간의 Contrastive Decoding을 통해 Factual Knowledge를 증폭시킬 수 있음
- Early Layer는 Last Layer와의 JSD가 가장 먼 Layer를 선택하여 DoLa를 수행함

Attention Interpretation

[24’ ICLR] PASTA: Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
- User-specified Text를 강조하고자 해당 부분의 Attention을 높이는 Attention Steering을 도입하였으며, 성능 향상에 도움이 되는 Head를 찾아내어 선택적으로 Steering을 적용하는 것이 중요함을 확인
- Steering의 정도를 결정하는 Scaling Factor $\alpha$는 Robust하여 큰 영향을 미치지 않음
[24’ ICLR] Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
- Factual Queries, 즉 Constraint에 대한 Attention이 높은 것과 LLM의 Factual Correctness 사이에는 강한 상관관계가 있음
- 따라서 Attention Weight을 통해 Factual Error를 감지할 수 있으며, 이를 SAT Probe라고 명명함; 이는 기존에 Hallucination을 감지하기 위해 사용하던 LLM의 Confidence와 상호보완적임

Attention Sink

[24’ ICLR] StreamingLLM: Efficient Streaming Language Models with Attention Sinks
- Attention Sink 개념을 최초로 제안하였으며, 이것이 Attention Mechanism의 안정화에 중요한 역할을 한다는 것을 밝힘
- 이를 사용하여 개량된 Window Attention 방식인 StreamingLLM을 제안하였으며, 이를 통해 Memory Usage를 크게 줄이면서 성능을 유지할 수 있었음
[24’ ICML] ACT: Unveiling and Harnessing Hidden Attention Sinks
- Attention Sink는 First Token 뿐만 아니라 Less Semantic Token에도 발생함을 밝힘
- 일부 Head에서는 Attention Sink를 줄이면 성능이 향상되는 것을 확인함
[24’] Spectral Filters, Dark Signals, and Attention Sinks
- Unembedding Matrix를 SVD로 분해하여 Spectral Filter로 나타내는 새로운 분석 방법을 제안함
- Singular Value가 매우 낮은 Right Singular Vector가 Span하는 Dark Signal이 Attention Sink와 Generation Quality에 중요한 역할을 한다는 것을 밝힘
- HMLV(High Mean Low Variance) Token을 통해 Attention Bar를 정의하고, 이 Token이 Bos Token과 같이 Dark Subspace를 가지고 있음을 확인함

3. Long-context LLM

[24’ TACL] Lost in the Middle: How Language Models Use Long Contexts
- LLM이 Long Context를 사용할 때 중간 부분에서 정보를 잘 가져오지 못한다는 U-shape effect를 발견함
[24’ ICML-WS] Transformers need glasses! Information over-squashing in language tasks
- Copying, Counting과 같은 간단한 문제일지라도 Sequence가 길어지면 LLM이 Task를 잘 수행하지 못한다는 것을 발견하고, 이를 Representational Collapse와 Over-squashing이라는 개념으로 설명함

I'm rubatoyeong

Language Model Post List

📏 Alignment Tuning

1. RLHF

2. RLAIF

🔥 Parameter-Efficient Model Adaptation

1. PEFT Method

2. PEFT Analysis

🤔 Chain-of-Thought

1. Chain-of-Thought

2. Others

🕶 Mechanical Interpretability

1. Attribution Analysis

Token-wise Analysis

Layer-wise Analysis

2. Attention & Probability

Contrastive Decoding

Attention Interpretation

Attention Sink

3. Long-context LLM

Listup 카테고리 내 다른 글 보러가기

댓글 남기기

최근 글 보기