-
Notifications
You must be signed in to change notification settings - Fork 0
SequenceModels Implementations Examples
To briefly recall, SequenceModels, is a VathosModel Object, which is a Vathos Layer object which is a nn.Module object from torch, so it will behave as a normal nn.Module, plus a plenty of features. SequenceModels takes its structure from the Transformer (pre-norm) Architetture, each layer applies:
x = x + self.spatial_mixer(self.norm1(x))
x = x + self.channel_mixer(self.norm2(x))from Vathos.blocks import * # this contains SequenceModel, MLP, MultiHeadAttentionMixer, EasyEmbedder and UnbiasedLinear for this code, but also many other layers
model = SequenceModel(
vocab_size=100, # Set this to your vocab size
d_model=128, # Set this to the model dimension, (aka embedding dimension / hidden size)
n_layers=6, # The number of layers
max_len=2116, # Sinusoidal Positional encodings and other Encoding often requires this parameters to be set
pos_encoder=True, # if True uses SinusoidalPositionalEncoding, if False it does nothing. You can also pass a custom positional encoding Layer here
embedder=EasyEmbedder, # here The embedder, an nn.Module or Vathos Layer which takes at least d_model, vocab_size in the __init__, then takes B, L torch.Tensor of type Long as input and gives a B, L, d_model float output
unembedder=UnbiasedLinear, # This is the layer which needs to map inner reppresentation of shape B, L, d to B, L, vocab_size.
channel_mixer=MLP, # The channel mixer is a layer which does not mixes across the Lenght of the sequence, but only across embedding (channels). its the feedforward network of the original transformer
channel_args={'expand': 2, 'activation': SwiGLU, 'depth':2}, # Argument of every Layer here are passed that way, since they're kwargs.
spatial_mixer=MultiheadAttentionMixer, # The spatial mixer constructor, in out case this is a MultiheadAttention since we are building a transformer
rope=False, # Only works is MultiheadAttentionMixer is passed as spatial mixer.
spatial_args={'n_heads': 8, 'causal': True} # kwargs passed to each spatial mixer instance
)To clarify important things:
- Each Layer Object as embedder, unembedder, spatial_mixer, channel_mixer, has to be an nn.Module or Vahos Layer (recommended) constructor, not an object!
- The argument of various constructors are passed as dictionary. for every layer the d_model argument which is reduntant, must be omitted as the model already passes it to the various layers!
Note
Most of the arguments I passed here can't be left by default in the case you're building a simple 1d auto-regressive model such as an llm.
To be precise (arg: default_value): pos_encoder=True, embedder=EasyEmbedder, unembedder=UnbiasedLinear, spatial_mixer=MultiHeadAttention, spatial_args= {'causal': True, 'n_heads': 8}, rope=False
from Vathos.blocks import *
from Vathos.torch_layers.Kroneckers import KroneckerMixer1
model = SequenceModel(
vocab_size=VOCAB_SIZE,
d_model=D_MODEL,
n_layers=6,
max_len=2116,
pos_encoder=True,
embedder=EasyEmbedder,
unembedder=UnbiasedLinear,
channel_mixer=MLP,
channel_args={'depth': 2, 'expand': 2, 'activation': nn.GELU},
rope=True,
spatial_mixer=KroneckerMixer1,
spatial_args={'max_len': 2116} # Note that d_model is not passed
).to(device)from Vathos.blocks import *
self.model = SequenceModel(
vocab_size=n_class,
d_model=d_model,
n_layers=n_layers,
pos_encoder=self.pe,
rope=self.rope,
embedder=PatchEmbedder,
embedder_args={"d_model": d_model, "img_size": img_size, "patch_size": patch_size, "in_chans": in_chans},
unembedder=MeanClassificationHead,
spatial_args={"causal": False, "n_heads": n_heads, "rope": self.rope},
channel_args={"expand": ff_expand, "activation": ACTIVS[activation], "depth": 2}
).to(device)A spatial mixer has to be an nn.Module object or Vathos Layer (recommended). It has to have a constructor which takes d_model as first argument, and then the other argument you need:
__init__(self, d_model, ...)and a forward method
forward(self, x: torch.Tensor) -> torch.Tensorthe input tensor has expected to be of shape [B, L, d_model] and output is expected to be another [B, L, d_model]
here's is an example of a basic spatial mixer, where the full spatial mixing matrix L by L is learned (in MLP-Mixer Style)
class LinearMixer(Layer): # simply inherit Layer class
__name__ = "Linear Mixer" # you can provide a name
__complexity__ = "O(L^2 d)" # you can provide a complexity for the model to be able to compute full perplexity
def __init__(self, d_model, max_len=1000, causal=True):
super().__init__()
self.causal = causal
self.d_model = d_model
self.W = nn.Parameter(torch.randn(max_len, max_len) * 0.01)
def forward(self, x):
x = x / math.sqrt(self.d_model)
if self.causal:
return torch.tril(self.W[:x.shape[1], :x.shape[1]]) @ x
else:
return self.W[:x.shape[1], :x.shape[1]] @ xAfter that simply pass the constructor LinearMixer as spatial_mixer in the SequenceModel init, and if neede pass parameters e.g.
self.model = SequenceModel(
vocab_size=10,
d_model=128,
n_layers=6,
pos_encoder=False,
embedder=EasyEmbedder,
unembedder=UnbiasedLinear,
spatial_mixer=LinearMixer
spatial_args={"causal": True, "max_len": 2048},
channel_args={"expand": 2, "activation": nn.GELU, "depth": 2}
).to(device)Hope you understood ¯_(ツ)_/¯