PR #1170

open

Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (single seed)

by Christopher-Lee-McClendonView on GitHub

val_bpb

1.1199

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,745,776 bytes

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

BigramHash

Bigram hash embedding component used in the model.

parameters: {"vocab":4096,"dim":128}

XSA

Cross-sequence attention applied to all transformer layers.

parameters: {"layers":11}

Value Residual

Adds value residual connections in attention.

parameters: null

Gated Attention

Uses gated attention in the transformer blocks.

parameters: null

LeakyReLU

Uses LeakyReLU squared activation.

parameters: {"variant":"squared","slope":0.5}

Partial RoPE

Applies partial rotary positional encoding.

parameters: {"dimensions":16,"base":10000}

weight tying

Tied input and output embeddings.

parameters: null

MLP3x

Uses 3x MLP expansion.

parameters: {"multiplier":3}

NativeFlowMatcher

Conditional flow matching velocity network inserted after final LayerNorm to correct hidden states.

parameters: {"hidden_dim":256}

Quantization

mixed int6/int5

bits: null

scope: MLP layers

Compression

zstd

level: 16

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":10,"chunk_size":32768,"freeze_blocks":2,"momentum":0.9}

Sequence Length

sequence_length

train_length: 2048

eval_length: 1024

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"adam_for_scalars_and_embeddings":true,"matrix_lr":0.025,"scalar_lr":0.025}

LR Schedule

warmdown

parameters: {"warmup_steps":20,"warmdown_steps":2800}

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: {"value":1}

Weight Averaging

EMA

parameters: {"decay":0.997}

Novel Contributions

NativeFlowMatcher (NFM) hidden-state correction module trained jointly with the autoregressive objective
Legal score-first test-time training combined with NFM
Single-seed exploratory evaluation of NFM + legal TTT
Mixed int6/int5 quantization with auto-downgrade to fit the 16MB artifact budget