RoPEMultiheadAttention#
- class RoPEMultiheadAttention(embed_dim, num_heads, dropout=0.0, bias=True, rope_base=10000.0, device=None, dtype=None)[source]#
Bases:
PositionalAttentionBaseMulti-head attention with rotary position embeddings applied to Q and K.
Query, key, and value tensors are first projected into per-head features. The entire head embedding is then rotated in position space, block by block in 2D pairs:
\[\mathbf{q}_r' = R(\mathbf{P}_Q)\,\mathbf{q}_r, \qquad \mathbf{k}_r' = R(\mathbf{P}_K)\,\mathbf{k}_r, \qquad \mathbf{v}' = \mathbf{v},\]where
Ris the block-wise 2D rotation induced by the sine/cosine tables. The attention scores are then\[\mathbf{A} = \operatorname{softmax}\left( \frac{\mathbf{q}'\mathbf{k}'^\top}{\sqrt{d_h}} + \mathbf{M}\right), \qquad \mathbf{O} = \mathbf{A}\mathbf{v}'.\]Shape#
query,key,value:(B, T, D).q_positions,k_positions:(P,)or(B, P).q_position_mask,k_position_mask: boolean masks with the same shapeas the corresponding position tensor.
- Returns:
(output, attn_weights)whereoutputhas the same leading layout as the input and
attn_weightsis(B, T_q, T_k)when requested.
- Returns:
Attributes:#
- embed_dim:
Total feature width
D.- num_heads:
Number of attention heads.
- head_dim:
Width of each head,
head_dim = embed_dim / num_heads.- dropout:
Dropout probability applied to attention weights during training.
- q_proj, k_proj, v_proj, out_proj:
Learnable linear projections used to form queries, keys, values, and the final output.
- rotary_emb:
Helper module that builds the sine and cosine tables used by RoPE.
Initialize the rotary-position attention block.
- type embed_dim:
- param embed_dim:
Model width
D.- type embed_dim:
- type num_heads:
- param num_heads:
Number of attention heads.
- type num_heads:
- type dropout:
- param dropout:
Dropout probability on attention weights. Default: 0.0.
- type dropout:
- type bias:
- param bias:
If
True, adds learnable input and output projection biases. Default:True.- type bias:
- type rope_base:
- param rope_base:
Frequency base used to build the rotary spectrum. Default:
10000.0.- type rope_base:
- type device:
torch.device, optional- param device:
Parameter factory options.
- type device:
torch.device, optional- type dtype:
torch.dtype, optional- param dtype:
Parameter factory options.
- type dtype:
torch.dtype, optional
- forward(query, key, value, *, q_positions=None, k_positions=None, q_position_mask=None, k_position_mask=None, attn_mask=None, key_padding_mask=None, need_weights=False, is_causal=False)[source]#
Project inputs, apply RoPE to each head, then compute attention.
- Return type:
- Parameters:
Shape#
query,key,value: seePositionalAttentionBase.Returns:
(output, attn_weights)with the same leading layout as the input and optional attention weights when requested.