val_bpb
1.3587
Architecture
Mamba
Optimizer
Muon
Artifact Size
15.93 MB
Training Techniques
Architecture
Mamba
Pure recurrent Mamba state space model baseline without attention layers.
parameters: {"d_model":640,"d_inner":1280,"d_state":34,"d_conv":4,"num_layers":8,"head_adapter_rank":16,"vocab_size":1056}
weight tying
Tied embedding logits with a low-rank head adapter on top.
parameters: null
other
Fat State design using d_state=34 to increase recurrent memory capacity.
parameters: {"d_state":34}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"2D weight matrices"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalar parameters","fused":true}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: all weights; embeddings int8
QAT
bits: 6
scope: weights
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8}
Other
other
Score-first test-time training updates LoRA adapters on previous window tokens before scoring the current window.
parameters: null
other
Online entropy-based data filtering using a zlib compression ratio heuristic at batch load time.
parameters: {"thresholds":[4,2.5,2,1.8]}
LR Schedule
warmup + stable + cosine decay
parameters: {"warmup":0.1,"stable":0.7,"decay":0.2}
Regularization
gradient checkpointing
parameters: null
Novel Contributions
- Pure Mamba SSM baseline within the 16 MB / 10-minute constraint
- Fat State design with d_state=34 for richer recurrent memory
- Score-First LoRA TTT adapted for recurrent architectures
- Entropy-based data filtering heuristic at batch load time
- GPTQ-lite int6 compression pipeline with zstd artifact compression