PR #1170

open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (single seed)

by Christopher-Lee-McClendonView on GitHub
val_bpb
1.1199
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,745,776 bytes

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Bigram hash embedding component used in the model.
parameters: {"vocab":4096,"dim":128}
XSA
Cross-sequence attention applied to all transformer layers.
parameters: {"layers":11}
Value Residual
Adds value residual connections in attention.
parameters: null
Gated Attention
Uses gated attention in the transformer blocks.
parameters: null
LeakyReLU
Uses LeakyReLU squared activation.
parameters: {"variant":"squared","slope":0.5}
Partial RoPE
Applies partial rotary positional encoding.
parameters: {"dimensions":16,"base":10000}
weight tying
Tied input and output embeddings.
parameters: null
MLP3x
Uses 3x MLP expansion.
parameters: {"multiplier":3}
NativeFlowMatcher
Conditional flow matching velocity network inserted after final LayerNorm to correct hidden states.
parameters: {"hidden_dim":256}
Quantization
mixed int6/int5
bits: null
scope: MLP layers
Compression
zstd
level: 16
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":10,"chunk_size":32768,"freeze_blocks":2,"momentum":0.9}
Sequence Length
sequence_length
train_length: 2048
eval_length: 1024
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"adam_for_scalars_and_embeddings":true,"matrix_lr":0.025,"scalar_lr":0.025}
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_steps":2800}
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: {"value":1}
Weight Averaging
EMA
parameters: {"decay":0.997}

Novel Contributions

  • NativeFlowMatcher (NFM) hidden-state correction module trained jointly with the autoregressive objective
  • Legal score-first test-time training combined with NFM
  • Single-seed exploratory evaluation of NFM + legal TTT
  • Mixed int6/int5 quantization with auto-downgrade to fit the 16MB artifact budget