Transformer

NAVER AI TECH 2023. 4. 5. 14:48

메모리 소모량 및 연산 속도

Self-Attention는 Recurrent에 비해 메모리 소모량은 더 높지만 연산 속도는 더 빠르다.

Scaling Factor

평균이 0, 분산이 1인 확률분포에서 추출한 행렬 $Q, K$에 따라서 연산 $QK^T$를 수행한다고 하자.

이때, $QK^T$는 평균이 0 분산이 $d_k$인 행렬이 된다.

softmax 연산은 지수 함수(exponential function)를 활용하므로 큰 원소값이 지나치게 큰 결과값을 가지게 된다.

이를 막기 위해서 $QK^T$를 $\sqrt(d_k)$로 나누어 평균이 0, 분산이 1인 행렬을 만들어준다.

While for small values of dk the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ . We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt(d_k)}$.

Masked Self-Attention

아래 행렬이 decoder에서의 $softmax(\frac{QK^T}{\sqrt(d_k)})V$의 결과값이라고 하자.

Masking은 다음과 같이 진행할 수 있다.

먼저, 아래와 같이 우측 상단 값을 0으로 바꿔준다.

이후, 각 행(row)을 합이 1인 확률값으로 변환한다.

1	0	0
0.472	0.528	0
0.25	0.31	0.44

(7강-실습) Multi head Attention 구현_조민우_T5200

(8강-실습) Masked Multi-head Attention 구현_조민우_T5200

(기본-4) Preprocessing for NMT Model (문제)_조민우_T5200

'NAVER AI TECH' 카테고리의 다른 글

8주차 학습 내용 (AI 서비스 개발 기초) (0)	2023.04.25
7주차 회고록 (Level 1 Project 종료) (0)	2023.04.22
seq2seq (0)	2023.04.03
RNN, LSTM, and GRU (0)	2023.03.29
Attention Is All You Need (0)	2023.03.28

ABOUT ME

동산 동산

'NAVER AI TECH' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'NAVER AI TECH' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바