PR #1371

open

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study

by aarjunsrinivasanView on GitHub
val_bpb
1.4709
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.06MB

Training Techniques

Architecture
Gated DeltaNet
Replaces most attention layers with Gated DeltaNet linear-recurrent layers in an Olmo Hybrid-style interleaving pattern.
parameters: {"layers":9,"gdn_layers":7,"attention_layers":2,"ratio":"3:1","head_dim_ratio":0.75,"expand_v":2,"conv_size":4}
GQA
Uses grouped-query attention in the attention layers.
parameters: {"q_heads":8,"kv_heads":4}
U-Net skip connections
Retains the U-Net style skip/residual structure from the baseline transformer.
parameters: {"layers":9}
RoPE
Uses rotary positional embeddings.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":158,"max_wallclock_seconds":600}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
sequence_length
train_length: 16384
eval_length: 16384
sequence_length
train_length: 32768
eval_length: 32768
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null

Novel Contributions

  • Introduces an Olmo Hybrid-style GDN + attention architecture for parameter-golf long-context training.
  • Shows a clear performance crossover between 8k and 16k context where the hybrid begins outperforming full attention.
  • Demonstrates that the hybrid avoids the baseline's 32k long-context blow-up under the same 600-second wall-clock budget.
  • Uses multi-seed runs to quantify variance at 16k and 32k context lengths.
  • Provides reproduction scripts, logs, and a detailed long-context study within the 16MB artifact limit.