⚙️

Relative positional encoding

참고

[논문리뷰] Relative Position Representations in Transformer

이때 이 매커니즘만으로는 시퀀스의 순서를 모델링할 수 없다. 예를 들어 "철수 / 가 / 영희 / 를 / 좋아해"라는 시퀀스와 "영희 / 가 / 철수 / 를 / 좋아해"라는 시퀀스에서 "철수"에 해당하는 attention layer의 아웃풋은 두 문장에서 완벽하게 동일하다. 이러한 문제를 해결하기 위해 2017년에 발표된 Transformer 논문에서는 인풋에 위치 인코딩 (position encoding) 을 더해주는 방법을 사용하였다.

https://littlefoxdiary.tistory.com/94

import math
import torch
def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
    relative_buckets = 0
    if bidirectional:
        num_buckets //= 2
        relative_buckets += (relative_position > 0).to(torch.long) * num_buckets
        relative_position = torch.abs(relative_position)
    else:
        relative_position = -torch.min(relative_position, torch.zeros_like(relative_position))
    # now relative_position is in the range [0, inf)

    # half of the buckets are for exact increments in positions
    max_exact = num_buckets // 2
    is_small = relative_position < max_exact

    # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
    relative_postion_if_large = max_exact + (
        torch.log(relative_position.float() / max_exact)
        / math.log(max_distance / max_exact)
        * (num_buckets - max_exact)
    ).to(torch.long)
    relative_postion_if_large = torch.min(
        relative_postion_if_large, torch.full_like(relative_postion_if_large, num_buckets - 1)
    )

    relative_buckets += torch.where(is_small, relative_position, relative_postion_if_large)
    return relative_buckets
Python
복사

인풋 시퀀스 512인 문장에 대해 Self-attention 상황을 가정하면, relative position은 다음과 같이 구할 수 있다.

이제 이 relative position을 _relative_position_bucket에 대입하면 상대적인 거리에 따른 버킷 값을 얻을 수 있다.

이 버킷 값은 scalar로 임베딩 하여 attention score을 구할 때 logit에 더해져 위치 정보를 반영하게 된다.

(가까운 거리에 있는 토큰이 더 큰 가중치를 받을 수 있게 학습되지 않을까?)