[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

oleksost · 2025-03-20T02:44:12Z

✨ Description

This PR integrates Mamba1 and discrete Mamba2 blocks into fast-llm training pypeline, this is the initial step to address #68 .
It introduces a basic hybrid architecture that can interleave transformer and mamba-1 blocks.

Next steps:

add discrete mamba-2 from Llamba
be able to load and export Llamba-8B to support Llamba support #197

The training with a simple hybrid model can be tested:

Install mamba_ssm and 'causal-conv1d' dependency, pip install mamba_ssm[causal-conv1d]==2.2.4

launch training by passing

    "args": [
                "train",
                "hybrid_ssm",
                "--config",
                "path/to/hybrid_config.yaml"
            ],

and the following simple config to build a hybrid model:

model:
  base_model:
    transformer:
      num_layers: 6
      use_flash_attention: no  
    ssm:
      dt_rank: auto
      state_size: 16
      expansion_factor: 2
      debug_ssm: false
    block_pattern: ["m", "t", "m", "m2", "m", "m"] # mixing transformer, mamba 1 and descrete mamba layers
  
  distributed:
    training_dtype: bf16
    tensor_parallel: 1 
    pipeline_parallel: 1
    world_size: 1 

training:
  train_iters: 1000  
  logs:
    interval: 10
  validation:
    iterations: 25
    interval: 1000
  wandb:  
    project_name: fast-llm-ssm-test
    group_name: ssm
    entity_name: null

data:
  datasets:
    Training:
      type: memmap
      path: /home/toolkit/dev/fast-llm-tutorial/dataset/shard_0_0
    Validation:
      type: memmap
      path: /home/toolkit/dev/fast-llm-tutorial/dataset/shard_0_0

To load Llamba1B model, add the following to the config:

pretrained:
  format: llamba
  path: /mnt/checkpoints/pretrained_models/Llamba-1B

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added minimal mamba-1 layer and block
added hybrid model and corresponding configs
the implementation follows the one from https://github.com/Zyphra/Zamba2 and https://github.com/state-spaces/mamba

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

🗒️ Additional Notes

currently some parameters that are used for defining hybrid model's architecture are in the transformer config, e.g. num_layers, but they should probably moved to higher level configs at some point

fast_llm/layers/ssm/csrc/selective_scan/reverse_scan.cuh

fast_llm/layers/ssm/lamba_block.py

fast_llm/layers/ssm/mamba_layer.py

jlamypoirier · 2025-04-08T00:59:17Z

fast_llm/models/ssm/config.py

+        hint=FieldHint.core,
+    )
+
+    default_block: str = Field(


Redundant with block_pattern.

Currently, this is needed to load Llamba model: we set default_block to m2 in _create_config_converters of LLambaHuggingfaceCheckpointHandler and the block_pattern is then created in the __post_init__ of HybridSSMBaseModelConfig. This is a bit cumbersome indeed.

Why can't it set block_pattern? Also this would need to go in _validate, not __post_init__

afaiu _create_config_converters does not know about the num_layers in the loaded config, so we cannot set the block pattern as it depends on the number of layers.

Moved the post_init logic for block config to validate

Oh so block_pattern is not specifying a repeated pattern, but the entire list? Why not just repeating the list up to num_layers instead, as the name suggests?

Renamed block_pattern into hybrid_block_layout and use default_block repeated 'num_layers' times in case 'hybrid_block_layout' is not specified.

Ok, but why not a single variable with automated repetition?

@oleksost, try merging these but use this default factory:

hybrid_block_layout: list[str] = Field( default_factory=lambda: ['m'], desc="Pattern of blocks to use in the model. 't' for Transformer, 'm' for Mamba1, 'm2' for Descrete Mamba2.", hint=FieldHint.core, )

that avoids the mutable default trap.

fast_llm/layers/ssm/mamba_layer.py

fast_llm/layers/ssm/config.py

fast_llm/models/ssm/config.py

jlamypoirier · 2025-04-11T03:25:54Z

fast_llm/models/ssm/config.py

+        hint=FieldHint.core,
+    )
+
+    default_block: str = Field(


Oh so block_pattern is not specifying a repeated pattern, but the entire list? Why not just repeating the list up to num_layers instead, as the name suggests?

fast_llm/models/ssm/config.py

fast_llm/layers/ssm/config.py

jlamypoirier · 2025-04-15T21:05:37Z

fast_llm/layers/ssm/config.py

+
+
+@config_class()
+class SSMArchitectureConfig(BaseModelArchitectureConfig):


Please adjust field names for our naming conventions.

jlamypoirier · 2025-04-15T21:06:32Z

fast_llm/layers/ssm/config.py

+        hint=FieldHint.core,
+    )
+
+    dt_rank: str | int = Field(


Please use None for derived defaults. dt_rank: int = Field(default=None, ...

fast_llm/layers/ssm/discrete_mamba2.py

jlamypoirier · 2025-04-15T21:58:26Z

fast_llm/models/ssm/config.py

+        strict: bool = True,
+        flat: bool = False,
+    ) -> typing.Self:
+        if "hybrid_block_layout" in default and isinstance(default["hybrid_block_layout"], dict):


Why would it be a dict? There must be another problem elsewhere.

do we have tests that check if serializing/deserializing lists with strings works correctly? the special property of strings is that they are also lists, kind-of. there could be issues because of that, and they may be not captured by tests if we only test integers of ints, say.

@oleksost I see a patch to set_nested_dict_value below. My hunch is that this isn't working properly

fast_llm/models/ssm/config.py

jlamypoirier · 2025-04-15T22:06:44Z

fast_llm/models/ssm/model.py

+                )
+            elif block_type == "m2":
+                # Create Mamba2 descrete block
+                mixer_cls = partial(DiscreteMamba2, layer_idx=i)


Not needed, you're already passing layer_idx to the mixer_cls() call

fast_llm/models/ssm/config.py

jlamypoirier · 2025-04-15T22:16:23Z

fast_llm/layers/ssm/mamba_layer.py

+    return init_
+
+
+class MambaLayer(torch.nn.Module):


This doesn't work with TP, need to explicitly prevent. (Not sure about PP).

we're gonna have TP eventually, not in scope for this PR though

tscholak · 2025-04-16T02:06:45Z

@jlamypoirier, naming-wise, "layer" implies a single functional unit, like attention or an MLP. But here, the repeated unit is a composite: It includes a mixer (attention, SSM, etc.), normalization, residuals, and post-mixer processing (MLPs, MoEs, etc.). This structure isn't atomic. It's multiple layers stitched into a reusable computation unit, which is typically referred to as a "block" in other model families (e.g., ResNet, Swin, and even “Transformer blocks” in many papers and blogs).
Now that we're supporting architectures beyond transformers (Mamba, etc.), the term “block” avoids misleading assumptions:

"Layer" strongly implies a fixed layout rooted in the transformer design.
"Block" reflects the actual structure: a swappable, composite computation unit.

Keeping the name BaseBlock lets us consistently subclass it for Mamba, attention, etc., without implying that all of them are "layers" in the same architectural tradition. In this sense, "TransformerLayer" can be a Block instance, but not every Block is a TransformerLayer.

Also, and I find this the most important argument: internally and casually, we already call them blocks. Making the code match mental models reduces friction.

jlamypoirier · 2025-04-16T15:43:23Z

@tscholak Sounds reasonable, problem s these things are called "layers" everywhere else in Fast-LLM. Should we think about renaming these too?

tscholak · 2025-04-16T22:32:24Z

fast_llm/tools/cli.py

@@ -32,7 +32,6 @@ def fast_llm(args=None):
        sys.exit(1)
    except Exception:  # noqa
        logger.critical(traceback.format_exc())
-        sys.exit(1)


I know we need this line to be removed for the debugger to work. You could do this instead:

except Exception: # noqa if sys.gettrace(): raise logger.critical(traceback.format_exc()) sys.exit(1)

That would work, we do need that line outside a debugger. Same thing needed for ValidationError above.

tscholak · 2025-04-16T22:43:09Z

fast_llm/utils.py

+            else:
+                d[int(keys[-1])] = value
+        else:
+            d[keys[-1]] = value


This whole function is mysterious. It blindly creates empty dictionaries (setdefault(key, {})), even when we might actually want lists.
On top of this, do we need this patch? What does this do? The special case for lists is confusing. Do we have tests for the added behavior?

This method can't really support list, let's not try to make up a new handling rule for them.
I don't really see what would be the use for it here anyway, since it's easy to just pass a list of strings.
Also set_nested_dict_value has been rewritten in main...

tscholak

Looks already very good to me, thanks @oleksost
This should go in asap!
I had a few comments and suggestions. Mostly, I think we don't want to be too picky with this PR at this point because we're going to be actively working on improving many parts of this anyway in the next weeks.

jlamypoirier · 2025-04-16T23:19:36Z

fast_llm/layers/transformer/transformer.py

@@ -91,7 +94,7 @@ def _debug_log(self, tensor: torch.Tensor | None, name: str, kwargs: dict[str, t
                distributed=self._tensor_space.distributed,
            )

-    def forward(
+    def _forward_impl(


Why the renaming?

jlamypoirier

I guess we can ignore some issues with the current SSM implementation, if we handle it properly. This means clear warnings that the model is experimental and may break at any point, ex. at model runtime and/or in file headers. We still need to fix code changes outside the model though (transformers.py and set_nested_dict_value)

Also keep in mind that future modifications may break experiment configs, pretrained models and checkpoints, hence the importance of getting good config and parameter structures as soon as possible.

jlamypoirier · 2025-04-16T23:24:38Z

fast_llm/tools/cli.py

@@ -32,7 +32,6 @@ def fast_llm(args=None):
        sys.exit(1)
    except Exception:  # noqa
        logger.critical(traceback.format_exc())
-        sys.exit(1)


That would work, we do need that line outside a debugger. Same thing needed for ValidationError above.

jlamypoirier · 2025-04-16T23:24:53Z

fast_llm/layers/transformer/transformer.py

@@ -1,5 +1,6 @@
 import logging
 import typing
+from abc import ABC, abstractmethod


This is a good moment to reflect on the pattern.

Right now, Fast-LLM expects contributors to internalize a set of nuanced import rules (see https://servicenow.github.io/Fast-LLM/contributing/style-guide/#imports) that go beyond what most Python projects require. That may have worked at some point, but it doesn't scale. New contributors can't memorize this, and even returning ones keep tripping over it.

If this style is important, it needs to be enforced and automatically fixable through linting or a pre-commit hook. If it can't be, we should let go of it. Patterns that can't be learned quickly or applied automatically create friction and slow down the team.

@jlamypoirier, could you file a ticket outlining what it would take to automate this rule? That's the only sustainable way forward.

tscholak · 2025-04-17T00:51:17Z

This means clear warnings that the model is experimental and may break at any point, ex. at model runtime and/or in file headers.

@oleksost, can you emit a warning in the logger when someone tries to instantiate the config for this model class? that should be enough for now.

oleksost added 4 commits February 18, 2025 14:31

wip

ad2b8d5

wip

5dbc72a

WIP

5213e9e

mamba1 block

01d0fe4

oleksost marked this pull request as draft March 20, 2025 02:47

oleksost added 4 commits March 20, 2025 02:51

removed build from remote

ae8f3ca

removed unneccesary tests

fa6d3bc

removed unneccesary files

963e674

test

0842fcb

oleksost changed the title ~~Mamba 1 blocks~~ [feat] Mamba 1 blocks Mar 20, 2025

tscholak added the enhancement New feature or request label Mar 23, 2025

oleksost added 2 commits March 24, 2025 20:18

test mamba1

1c81719

tensor dimentions

37ba0d5

jlamypoirier reviewed Mar 24, 2025

View reviewed changes

fast_llm/layers/ssm/csrc/selective_scan/reverse_scan.cuh Outdated Show resolved Hide resolved

oleksost added 7 commits March 24, 2025 22:49

meta init with full model run

11a5db3

training, but having backward issues

4af7eb7

integration into training pipeline

be93749

mamba2

dd469bc

renamed config + skip test

ebe1b75

skip tests if mamba not installed

a4400fd

pre-commits

c49148c

oleksost changed the title ~~[feat] Mamba 1 blocks~~ [feat] Hybrid Mamba-1 model Mar 31, 2025

oleksost requested a review from tscholak March 31, 2025 13:06

cleanup

5c8d930

oleksost requested a review from jlamypoirier March 31, 2025 17:22

dependencies

ef6791b

tscholak marked this pull request as ready for review March 31, 2025 20:01

oleksost added 2 commits March 31, 2025 21:32

descrete mamba2

f03dd10

Merge branch 'ssm_mamba2' into ssm

2414252

oleksost changed the title ~~[feat] Hybrid Mamba-1 model~~ [feat] Hybrid Mamba model with mamba and discrete mamba 2 layers Mar 31, 2025

docs.yaml

9e2897d

This was linked to issues Apr 7, 2025

Llamba support #197

Open

[feat] Support Mamba 2 blocks #68

Open

oleksost added 2 commits April 7, 2025 18:14

MTP hardcoded

b231cb8

import nvm

8ccaa28

oleksost force-pushed the ssm branch from 3e67df1 to 8ccaa28 Compare April 7, 2025 20:11

oleksost added 2 commits April 7, 2025 20:14

remove dependency on cartesia

864fff2

save llamba

7f2b35f

jlamypoirier reviewed Apr 8, 2025

View reviewed changes

addressed comments

81c71af

oleksost requested a review from jlamypoirier April 8, 2025 15:42

oleksost added 3 commits April 9, 2025 03:25

addressed comments

7b7ce62

Merge branch 'main' into ssm

776e67b

nvm

3456884

oleksost commented Apr 10, 2025

View reviewed changes

fast_llm/models/ssm/config.py Show resolved Hide resolved

jlamypoirier reviewed Apr 11, 2025

View reviewed changes

oleksost added 2 commits April 11, 2025 14:53

renamed block pattern into block layout

b48f68d

renames

9a35783

oleksost requested a review from jlamypoirier April 11, 2025 15:00

nvm

32b8aa1

jlamypoirier reviewed Apr 15, 2025

View reviewed changes

wip

4f9aad0

tscholak reviewed Apr 16, 2025

View reviewed changes

jlamypoirier reviewed Apr 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

oleksost commented Mar 20, 2025 •

edited

Loading

jlamypoirier Apr 8, 2025

oleksost Apr 8, 2025 •

edited

Loading

jlamypoirier Apr 8, 2025

oleksost Apr 9, 2025

jlamypoirier Apr 11, 2025

oleksost Apr 11, 2025 •

edited

Loading

jlamypoirier Apr 15, 2025

tscholak Apr 16, 2025

jlamypoirier Apr 11, 2025

jlamypoirier Apr 15, 2025

jlamypoirier Apr 15, 2025

jlamypoirier Apr 15, 2025

tscholak Apr 16, 2025

tscholak Apr 16, 2025

jlamypoirier Apr 15, 2025

jlamypoirier Apr 15, 2025

tscholak Apr 16, 2025

tscholak commented Apr 16, 2025

jlamypoirier commented Apr 16, 2025

tscholak Apr 16, 2025

jlamypoirier Apr 16, 2025

tscholak Apr 16, 2025

jlamypoirier Apr 16, 2025

tscholak left a comment

jlamypoirier Apr 16, 2025

jlamypoirier left a comment

jlamypoirier Apr 16, 2025

jlamypoirier Apr 16, 2025

tscholak Apr 17, 2025

tscholak commented Apr 17, 2025



		@config_class()
		class SSMArchitectureConfig(BaseModelArchitectureConfig):

[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

Are you sure you want to change the base?

[feat] Hybrid Mamba model with Mamba and discrete Mamba 2 layers #194

Conversation

oleksost commented Mar 20, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

🗒️ Additional Notes

Choose a reason for hiding this comment

oleksost Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksost Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tscholak commented Apr 16, 2025

jlamypoirier commented Apr 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tscholak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlamypoirier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tscholak commented Apr 17, 2025

oleksost commented Mar 20, 2025 •

edited

Loading

oleksost Apr 8, 2025 •

edited

Loading

oleksost Apr 11, 2025 •

edited

Loading