eTransformerEncoderLayer#
- class eTransformerEncoderLayer(in_rep, nhead, dim_feedforward=2048, dropout=0.1, activation=<function relu>, layer_norm_eps=1e-05, batch_first=True, norm_first=True, norm_module='rmsnorm', bias=True, device=None, dtype=None, init_scheme='xavier_uniform')[source]#
Bases:
ModuleEquivariant Transformer encoder layer with the same API as
torch.nn.TransformerEncoderLayer.Applies
eMultiheadAttentionfollowed by an equivariant feed-forward block built fromeLinearlayers andeLayerNorm, mirroring PyTorch’s ordering (pre- or post-norm) while constraining every linear map to commute with the group action.The layer defines:
\[\mathbf{f}_{\mathbf{\theta}}: \mathcal{X}^{T} \to \mathcal{X}^{T}.\]Functional equivariance constraint:
\[\mathbf{f}_{\mathbf{\theta}}(\rho_{\mathcal{X}}(g)\mathbf{x}_{1:T}) = \rho_{\mathcal{X}}(g)\,\mathbf{f}_{\mathbf{\theta}}(\mathbf{x}_{1:T}) \quad \forall g\in\mathbb{G},\]where \(\rho_{\mathcal{X}}(g)\) acts on the feature/channel axis at every token.
Create an equivariant Transformer encoder layer.
- Parameters:
in_rep (
Representation) – Input representation \(\rho_{\text{in}}\).nhead (
int) – Number of attention heads.dim_feedforward (
int) – Hidden dimension of the feedforward network.dropout (
float) – Dropout probability.activation (
str|Callable[[Tensor],Tensor]) – Activation function ('relu'or'gelu').layer_norm_eps (
float) – Epsilon for layer normalization.batch_first (
bool) – IfTrue, input/output shape is(B, T, D).norm_first (
bool) – IfTrue, apply normalization before attention/feedforward.norm_module (
Literal['layernorm','rmsnorm']) – Normalization layer type ('layernorm'or'rmsnorm').bias (
bool) – Whether to use bias in linear layers.device – Tensor device.
dtype – Tensor dtype.
init_scheme (
str|None) – Initialization scheme for equivariant layers.
- forward(src, src_mask=None, src_key_padding_mask=None, is_causal=False)[source]#
Pass the input through the equivariant encoder layer.
- Parameters:
src (
Tensor) – input sequence of shape(T, B, D)or(B, T, D)depending onbatch_first, with last dimension equal toin_rep.size.src_mask (
Tensor|None) – optional attention mask for the input sequence.src_key_padding_mask (
Tensor|None) – optional padding mask for the batch.is_causal (
bool) – ifTrue, applies a causal mask to the self-attention block.
- Return type: