SequenceModels Implementations Examples

Basics

To briefly recall, SequenceModels, is a VathosModel Object, which is a Vathos Layer object which is a nn.Module object from torch, so it will behave as a normal nn.Module, plus a plenty of features. SequenceModels takes its structure from the Transformer (pre-norm) Architetture, each layer applies:

x = x + self.spatial_mixer(self.norm1(x))
x = x + self.channel_mixer(self.norm2(x))

Transformer Implementation as an example

from Vathos.blocks import *   # this contains SequenceModel, MLP, MultiHeadAttentionMixer, EasyEmbedder and UnbiasedLinear for this code, but also many other layers
model = SequenceModel(
        vocab_size=100, #  Set this to your vocab size
        d_model=128,    #  Set this to the model dimension, (aka embedding dimension / hidden size)
        n_layers=6,     #  The number of layers
        max_len=2116,   #  Sinusoidal Positional encodings and other Encoding often requires this parameters to be set
        pos_encoder=True,  # if True uses SinusoidalPositionalEncoding, if False it does nothing. You can also pass a custom positional encoding Layer here
        embedder=EasyEmbedder, # here The embedder, an nn.Module or Vathos Layer which takes at least d_model, vocab_size in the __init__, then takes B, L torch.Tensor of type Long as input and gives a B, L, d_model float output
        unembedder=UnbiasedLinear, # This is the layer which needs to map inner reppresentation of shape B, L, d to B, L, vocab_size.
        channel_mixer=MLP,  # The channel mixer is a layer which does not mixes across the Lenght of the sequence, but only across embedding (channels). its the feedforward network of the original transformer
        channel_args={'expand': 2, 'activation': SwiGLU, 'depth':2}, # Argument of every Layer here are passed that way, since they're kwargs.
        spatial_mixer=MultiheadAttentionMixer, # The spatial mixer constructor, in out case this is a MultiheadAttention since we are building a transformer
        rope=False, # Only works is MultiheadAttentionMixer is passed as spatial mixer.
        spatial_args={'n_heads': 8, 'causal': True} # kwargs passed to each spatial mixer instance
    )

To clarify important things:

Each Layer Object as embedder, unembedder, spatial_mixer, channel_mixer, has to be an nn.Module or Vahos Layer (recommended) constructor, not an object!
The argument of various constructors are passed as dictionary. for every layer the d_model argument which is reduntant, must be omitted as the model already passes it to the various layers!

Note

Most of the arguments I passed here can't be left by default in the case you're building a simple 1d auto-regressive model such as an llm.

To be precise (arg: default_value): pos_encoder=True, embedder=EasyEmbedder, unembedder=UnbiasedLinear, spatial_mixer=MultiHeadAttention, spatial_args= {'causal': True, 'n_heads': 8}, rope=False

Implementation of a simple KroneckerMixer:

from Vathos.blocks import *
from Vathos.torch_layers.Kroneckers import KroneckerMixer1
model = SequenceModel(
        vocab_size=VOCAB_SIZE,
        d_model=D_MODEL,
        n_layers=6,
        max_len=2116,
        pos_encoder=True,
        embedder=EasyEmbedder,
        unembedder=UnbiasedLinear,
        channel_mixer=MLP,
        channel_args={'depth': 2, 'expand': 2, 'activation': nn.GELU},
        rope=True,
        spatial_mixer=KroneckerMixer1,
        spatial_args={'max_len': 2116} # Note that d_model is not passed
    ).to(device)

Implementation of a classification ViT:

from Vathos.blocks import *

self.model = SequenceModel(
            vocab_size=n_class,
            d_model=d_model,
            n_layers=n_layers,
            pos_encoder=self.pe,
            rope=self.rope,
            embedder=PatchEmbedder,
            embedder_args={"d_model": d_model, "img_size": img_size, "patch_size": patch_size, "in_chans": in_chans},
            unembedder=MeanClassificationHead,
            spatial_args={"causal": False, "n_heads": n_heads, "rope": self.rope},
            channel_args={"expand": ff_expand, "activation": ACTIVS[activation], "depth": 2}
        ).to(device)

How to build Spatial Mixers and Channel Mixers that works with Sequence Model:

Spatial Mixer

A spatial mixer has to be an nn.Module object or Vathos Layer (recommended). It has to have a constructor which takes d_model as first argument, and then the other argument you need:

__init__(self, d_model, ...)

and a forward method

forward(self, x: torch.Tensor) -> torch.Tensor

the input tensor has expected to be of shape [B, L, d_model] and output is expected to be another [B, L, d_model]

here's is an example of a basic spatial mixer, where the full spatial mixing matrix L by L is learned (in MLP-Mixer Style)

class LinearMixer(Layer): # simply inherit Layer class
    __name__ = "Linear Mixer" # you can provide a name
    __complexity__ = "O(L^2 d)" # you can provide a complexity for the model to be able to compute full perplexity

    def __init__(self, d_model, max_len=1000, causal=True):
        super().__init__()
        self.causal = causal
        self.d_model = d_model
        self.W = nn.Parameter(torch.randn(max_len, max_len) * 0.01)

    def forward(self, x):
        x = x / math.sqrt(self.d_model)
        if self.causal:
            return torch.tril(self.W[:x.shape[1], :x.shape[1]]) @ x
        else:
            return self.W[:x.shape[1], :x.shape[1]] @ x

After that simply pass the constructor LinearMixer as spatial_mixer in the SequenceModel init, and if neede pass parameters e.g.

self.model = SequenceModel(
            vocab_size=10,
            d_model=128,
            n_layers=6,
            pos_encoder=False,
            embedder=EasyEmbedder,
            unembedder=UnbiasedLinear,
            spatial_mixer=LinearMixer
            spatial_args={"causal": True, "max_len": 2048},
            channel_args={"expand": 2, "activation": nn.GELU, "depth": 2}
        ).to(device)

Hope you understood ¯_(ツ)_/¯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequenceModels Implementations Examples

Basics

Transformer Implementation as an example

Implementation of a simple KroneckerMixer:

Implementation of a classification ViT:

How to build Spatial Mixers and Channel Mixers that works with Sequence Model:

Spatial Mixer

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally