PR #2014
RECORDopenRecord: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)
by simonbissonnetteView on GitHub
val_bpb
1.0576
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.98 MB
Training Techniques
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"mask":"no_qv","local_lr_mult":0.75,"short_score_first":true,"short_doc_len":2000,"short_chunk_size":24,"phased":true,"num_phases":1,"prefix_docs":2500}
Architecture
SmearGate
BOS-fixed SmearGate used in the attention/gating stack
parameters: {"gate_window":12,"scale":0.5}
XSA
XSA applied across all layers
parameters: {"layers":11}
Partial RoPE
Partial rotary position embeddings
parameters: {"dimensions":16}
depth recurrence
Layers 3-5 are looped with recurrence enabled partway through training
parameters: {"layers":[3,4,5],"frac":0.35}
GQA
Grouped-query attention with fewer KV heads than query heads
parameters: {"query_heads":8,"kv_heads":4}
parallel decoder
Parallel lane from layer 8 with mean final lane aggregation
parameters: {"start_layer":8}
SparseAttnGate
Sparse attention gating used with fixed gate scale
parameters: {"gate_window":12,"scale":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_on":"embedding/scalars","beta2":0.99}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
LQER
bits: 4
scope: correction
AWQ-lite
bits: 8
scope: groups
Compression
pergroup
level: null
Evaluation
stride-based eval
parameters: {"stride":1536,"context_length":3072}
Regularization
weight decay
parameters: {"ttt_weight_decay":0.5}
Novel Contributions
- Progressive training-context schedule from 1k to 2k to 3k context
- Short-document score-first TTT chunk schedule
- Long-context TTT with no_qv mask and reduced local LR
- Preserving full validation target coverage while staying under the artifact cap
- Stacking the new schedule on top of the PR #1855 / PR #1953 lineage