PR #2133
openRecord candidate: PR #2014 base + GATE_WINDOW=8 + GPTQ_CALIBRATION_BATCHES=32
by codemath3000View on GitHub
val_bpb
1.0576
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dimensions":16}
depth recurrence
Loops layers 3-5 with recurrence enabled partway through training.
parameters: {"layers":[3,5],"frac":0.35}
SmearGate
BOS-fixed SmearGate used in the attention stack.
parameters: null
Gated Attention
Sparse attention gating with a fixed window over recent tokens.
parameters: {"window":8,"scale":0.5}
XSA
Applies XSA across all layers.
parameters: {"layers":11}
Optimizer
Muon
weight_decay: 0.5
momentum: null
other_params: {"beta2":0.99,"adam_on":"embedding/scalars"}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Quantization
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: embeddings
mixed int6/int7
bits: null
scope: weights + embeddings
GPTQ-lite
bits: 8
scope: AWQ-lite group quant
QAT
bits: null
scope: TTT quantized phased LoRA
Test-Time Training
score-first TTT
parameters: {"rank":80,"learning_rate":0.0001,"local_lr_mult":0.75,"mask":"no_qv","batch_size":24,"chunk_size":48}
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
Evaluation
stride-based eval
parameters: {"stride":1536}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Regularization
weight decay
parameters: {"value":0.5}
Compression
custom
level: null
Novel Contributions
- Reduced SparseAttnGate window from 12 to 8
- Increased GPTQ calibration batches from 16 to 32
- Kept the PR #2014 stack otherwise unchanged for a clean ablation