PR #1855
RECORDopenRecord: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean)
by codemath3000View on GitHub
val_bpb
1.0611
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB
Training Techniques
Architecture
U-Net skip connections
Encoder-decoder skip connections with skip gates across layers.
parameters: {"layers":11}
parallel decoder
Two-lane parallel decoder from later layers with learned lane mixing.
parameters: {"start_layer":8}
Partial RoPE
Rotary position embeddings applied to a subset of dimensions with YaRN.
parameters: {"dimensions":"16/64"}
Sparse Attention Gate
Narrow head-output gate applied to sparse attention outputs.
parameters: {"gate_window":12,"scale":0.5}
SmearGate
Position-mixing gate with BOS leak masking to prevent cross-document leakage.
parameters: {"bos_fixed":true}
LeakyReLU
Fused LeakyReLU-square MLP activation.
parameters: {"slope":0.5}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":5,"variant":"Polar-Express Newton-Schulz"}
Quantization
GPTQ
bits: 6
scope: matrix weights
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 8
scope: attn-gate
int4
bits: 4
scope: LQER correction
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"frac":0.85,"min_lr":0.1}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"phases":3,"prefix_docs":2500}
Sequence Length
sequence_length
train_length: null
eval_length: 2500
Compression
per-group lrzip+brotli
level: null
Other
other
LQER asymmetric int4 rank-4 quant-error correction on top-3 tensors.
parameters: {"rank":4,"top_k":3}
Novel Contributions
- BOS-fixed SmearGate cross-document leak fix
- Sparse attention head-output gate
- LQER asymmetric int4 rank-4 quant-error correction
- Polar-Express Newton-Schulz Muon optimizer setup
- 9-hyperparameter greedy forward-selected stack
- Phased TTT evaluation with 3 phases and 2500-doc prefix
- Per-group lrzip+brotli artifact compression