Rotary Position Embeddings (RoPE)¶
Rotary Position Embeddings for encoding position information in attention layers.
Overview¶
RoPE encodes position information by rotating query and key vectors in 2D subspaces. It's used in LLaMA, Mistral, Qwen, and most modern LLMs.
Mathematical formula:
For a vector \(x\) at position \(m\), the rotated vector is:
Where \(\theta_i = \frac{1}{\text{base}^{2i/d}}\) with typical base=10000.
Key Properties¶
- Relative position encoding: The dot product between rotated vectors depends only on their relative position
- Long-range decay: Attention naturally decays with distance due to the rotation frequencies
- No learned parameters: Position encodings are computed, not learned
Performance Characteristics¶
| Configuration | PyTorch | Triton | Speedup |
|---|---|---|---|
| head_dim=128, seq=2048 | 67 μs | 23 μs | 2.9x |
| head_dim=128, seq=8192 | 267 μs | 92 μs | 2.9x |
| head_dim=64, seq=2048 | 34 μs | 12 μs | 2.8x |
Usage Examples¶
Module API (Recommended)¶
import torch
from rotalabs_accel import RotaryEmbedding
# Create RoPE module
rope = RotaryEmbedding(
dim=128, # Head dimension
max_seq_len=8192, # Maximum sequence length
base=10000.0, # Frequency base (standard is 10000)
)
# Query and Key tensors
# Shape: [batch, seq_len, num_heads, head_dim]
q = torch.randn(2, 512, 32, 128, device="cuda", dtype=torch.float16)
k = torch.randn(2, 512, 32, 128, device="cuda", dtype=torch.float16)
# Apply RoPE
q_rot, k_rot = rope(q, k, seq_len=512)
Functional API¶
from rotalabs_accel import build_rope_cache, apply_rope
# Build cache once (at model initialization)
cos, sin = build_rope_cache(
seq_len=8192,
head_dim=128,
base=10000.0,
device="cuda",
)
# Apply during forward pass
# Slice cache to actual sequence length
q_rot, k_rot = apply_rope(q, k, cos[:seq_len], sin[:seq_len])
With Grouped Query Attention (GQA)¶
RoPE works with different numbers of Q and K heads:
# LLaMA 3 style: 32 Q heads, 8 KV heads
q = torch.randn(2, 512, 32, 128, device="cuda") # 32 heads
k = torch.randn(2, 512, 8, 128, device="cuda") # 8 heads
# apply_rope handles broadcasting automatically
q_rot, k_rot = rope(q, k, seq_len=512)
Position Offset (for KV Cache)¶
During generation with KV cache, you need to offset positions:
# First token: position 0
q1, k1 = rope(q[:, :1], k[:, :1], seq_len=1)
cached_k = k1
# Next token: position 1
# Pass offset to start from correct position
q2, k2 = rope(q[:, :1], k[:, :1], seq_len=1, offset=1)
cached_k = torch.cat([cached_k, k2], dim=1)
API Reference¶
Functions¶
apply_rope ¶
apply_rope(q: Tensor, k: Tensor, cos: Tensor, sin: Tensor, use_triton: Optional[bool] = None) -> Tuple[torch.Tensor, torch.Tensor]
Apply Rotary Position Embeddings to query and key tensors.
Uses Triton kernel on CUDA when available, otherwise falls back to PyTorch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
q
|
Tensor
|
Query tensor [batch, seq, heads, head_dim] |
required |
k
|
Tensor
|
Key tensor [batch, seq, heads, head_dim] |
required |
cos
|
Tensor
|
Cosine cache for positions |
required |
sin
|
Tensor
|
Sine cache for positions |
required |
use_triton
|
Optional[bool]
|
Force Triton (True) or PyTorch (False). None = auto. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor]
|
Tuple of (q_rotated, k_rotated) with same shapes as inputs. |
Example
q = torch.randn(2, 16, 4, 32) k = torch.randn(2, 16, 4, 32) cos, sin = build_rope_cache(16, 32) q_rot, k_rot = apply_rope(q, k, cos, sin)
Source code in src/rotalabs_accel/kernels/rope.py
rope_torch ¶
PyTorch reference implementation of RoPE.
Works on any device (CPU or CUDA).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
q
|
Tensor
|
Query tensor [batch, seq, heads, head_dim] |
required |
k
|
Tensor
|
Key tensor [batch, seq, heads, head_dim] |
required |
cos
|
Tensor
|
Cosine cache [seq, head_dim/2] or broadcastable |
required |
sin
|
Tensor
|
Sine cache [seq, head_dim/2] or broadcastable |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor]
|
Tuple of (q_rotated, k_rotated). |
Source code in src/rotalabs_accel/kernels/rope.py
build_rope_cache ¶
build_rope_cache(seq_len: int, head_dim: int, base: float = 10000.0, device: Optional[device] = None, dtype: dtype = torch.float32) -> Tuple[torch.Tensor, torch.Tensor]
Build cosine and sine caches for RoPE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_len
|
int
|
Maximum sequence length. |
required |
head_dim
|
int
|
Dimension of each attention head. |
required |
base
|
float
|
Base for the frequency computation (default: 10000). |
10000.0
|
device
|
Optional[device]
|
Device for the tensors. |
None
|
dtype
|
dtype
|
Data type for the tensors. |
float32
|
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor]
|
Tuple of (cos_cache, sin_cache), each of shape [seq_len, head_dim/2]. |
Example
cos, sin = build_rope_cache(2048, 128, device='cuda') print(cos.shape) # torch.Size([2048, 64])
Source code in src/rotalabs_accel/kernels/rope.py
Classes¶
RotaryEmbedding ¶
Bases: Module
Rotary Position Embedding module.
Uses Triton kernel on CUDA when available, otherwise falls back to PyTorch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Dimension of each attention head (head_dim). |
required |
max_seq_len
|
int
|
Maximum sequence length to cache. |
2048
|
base
|
float
|
Base for frequency computation (default: 10000). |
10000.0
|
Example
rope = RotaryEmbedding(dim=32, max_seq_len=128) q = torch.randn(2, 16, 4, 32) k = torch.randn(2, 16, 4, 32) q_rot, k_rot = rope(q, k)
Source code in src/rotalabs_accel/kernels/rope.py
250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | |
__init__ ¶
Source code in src/rotalabs_accel/kernels/rope.py
forward ¶
forward(q: Tensor, k: Tensor, position_ids: Optional[Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]
Apply RoPE to query and key tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
q
|
Tensor
|
Query tensor [batch, seq, heads, head_dim] |
required |
k
|
Tensor
|
Key tensor [batch, seq, heads, head_dim] |
required |
position_ids
|
Optional[Tensor]
|
Optional position indices [batch, seq]. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Tensor]
|
Tuple of (q_rotated, k_rotated). |
Source code in src/rotalabs_accel/kernels/rope.py
Implementation Notes¶
Cache Precomputation¶
The cos/sin tables are computed once and reused:
def build_rope_cache(seq_len, head_dim, base=10000.0, device="cuda"):
# Compute frequencies
inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2) / head_dim))
# Compute position angles
t = torch.arange(seq_len, device=device)
freqs = torch.outer(t, inv_freq)
# Cache cos and sin
cos = torch.cos(freqs)
sin = torch.sin(freqs)
return cos, sin
Memory Layout¶
The rotation is applied to pairs of adjacent dimensions:
- \((x_0, x_1)\) rotated by \(\theta_0\)
- \((x_2, x_3)\) rotated by \(\theta_1\)
- etc.
This "interleaved" layout matches LLaMA and most modern models. Some older models use "sequential" layout where first half and second half are paired.
Extended Context (YaRN, NTK)¶
For extended context lengths, you can modify the base frequency:
# NTK-aware scaling for 4x context extension
base = 10000 * 4.0
rope = RotaryEmbedding(dim=128, max_seq_len=32768, base=base)
References¶
- RoFormer: Enhanced Transformer with Rotary Position Embedding - Original RoPE paper
- LLaMA: Open and Efficient Foundation Language Models - Uses RoPE
- YaRN: Efficient Context Window Extension - Context extension with RoPE