TransformerEncoderLayer#

class TransformerEncoderLayer(d_model, self_attn, dim_feedforward=2048, dropout=0.1, activation=GELU(approximate='none'), layer_norm_eps=1e-05, norm_first=False, norm_module='rmsnorm', bias=True)[source]#

Bases: Module

Transformer encoder layer with optional positional attention.

Given an input sequence \(\mathbf{X} \in \mathbb{R}^{B \times T \times D}\), the layer computes:

\[\mathbf{X}' = \mathbf{X} + \operatorname{Dropout}\!\left( \operatorname{Attn}(\mathbf{X}, \mathbf{X}, \mathbf{X})\right), \qquad \mathbf{Y} = \mathbf{X}' + \operatorname{FFN}(\mathbf{X}'),\]

with layer normalization applied either before each residual branch (norm_first=True, pre-norm) or after (norm_first=False, post-norm). The attention operator \(\operatorname{Attn}\) is either torch.nn.MultiheadAttention or any PositionalAttentionBase subclass, the latter injecting positional information into query and key streams.

Attributes:#

self_attn:

Attention module used for the source self-attention block.

feed_forward_block:

Sequential feed-forward network applied after self-attention.

norm1, norm2:

Normalization layers (RMSNorm or LayerNorm) applied around the residual branches.

norm_first:

Whether normalization is applied before each residual branch.

Initialize the encoder layer.

type d_model:

int

param d_model:

Model width D.

type d_model:

int

type self_attn:

MultiheadAttention | PositionalAttentionBase

param self_attn:

Self-attention module.

type self_attn:

torch.nn.MultiheadAttention | PositionalAttentionBase

type dim_feedforward:

int

param dim_feedforward:

Width of the feed-forward hidden layer. Default: 2048.

type dim_feedforward:

int

type dropout:

float

param dropout:

Dropout probability. Default: 0.1.

type dropout:

float

type activation:

Module

param activation:

Feed-forward activation. Default: torch.nn.functional.gelu().

type activation:

str | callable

type layer_norm_eps:

float

param layer_norm_eps:

LayerNorm epsilon. Default: 1e-5.

type layer_norm_eps:

float

type norm_first:

bool

param norm_first:

If True, apply LayerNorm before each residual branch. Default: False.

type norm_first:

bool

type norm_module:

Literal['layernorm', 'rmsnorm']

param norm_module:

Normalization layer type ('layernorm' or 'rmsnorm'). Default: 'rmsnorm'.

type norm_module:

str

type bias:

bool

param bias:

If True, use learnable normalization biases. Default: True.

type bias:

bool

param device:

Parameter factory options.

type device:

torch.device, optional

param dtype:

Parameter factory options.

type dtype:

torch.dtype, optional

forward(src, src_mask=None, src_key_padding_mask=None, is_causal=False, *, src_positions=None, src_position_mask=None)[source]#

Apply the encoder layer to a batch-first source sequence.

Parameters:
  • src (torch.Tensor) – Input sequence.

  • src_mask (torch.Tensor, optional) – Additive attention mask for the source sequence.

  • src_key_padding_mask (torch.Tensor, optional) – Boolean mask for padded source elements.

  • is_causal (bool) – Whether to apply a causal attention mask. Default: False.

  • src_positions (torch.Tensor, optional) – Absolute positions for the source sequence, used by positional attention backends.

  • src_position_mask (torch.Tensor, optional) – Boolean mask for padded source positions, used by positional attention backends.

Returns:

The encoded source sequence.

Return type:

torch.Tensor

Shape#

  • src: (B, T, D).

  • src_positions: (T,) or (B, T).

  • src_position_mask: boolean mask with the same layout as src_positions._ff_block

  • Returns: encoded source with shape (B, T, D).

Parameters: