PR #1371

open

Non-record: Olmo Hybrid (GDN + Attention) for long-context training — 8k/16k/32k crossover study

by aarjunsrinivasanView on GitHub

val_bpb

1.4709

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.06MB

Training Techniques

Architecture

Gated DeltaNet

Replaces most attention layers with Gated DeltaNet linear-recurrent layers in an Olmo Hybrid-style interleaving pattern.

parameters: {"layers":9,"gdn_layers":7,"attention_layers":2,"ratio":"3:1","head_dim_ratio":0.75,"expand_v":2,"conv_size":4}

GQA

Uses grouped-query attention in the attention layers.

parameters: {"q_heads":8,"kv_heads":4}

U-Net skip connections

Retains the U-Net style skip/residual structure from the baseline transformer.

parameters: {"layers":9}

RoPE

Uses rotary positional embeddings.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_iters":158,"max_wallclock_seconds":600}

Sequence Length

sequence_length

train_length: 8192

eval_length: 8192

sequence_length

train_length: 16384

eval_length: 16384

sequence_length

train_length: 32768

eval_length: 32768

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Novel Contributions

Introduces an Olmo Hybrid-style GDN + attention architecture for parameter-golf long-context training.
Shows a clear performance crossover between 8k and 16k context where the hybrid begins outperforming full attention.
Demonstrates that the hybrid avoids the baseline's 32k long-context blow-up under the same 600-second wall-clock budget.
Uses multi-seed runs to quantify variance at 16k and 32k context lengths.
Provides reproduction scripts, logs, and a detailed long-context study within the 16MB artifact limit.