PR #2022
openRecord: SP10240 + SimCTG + QAHSP + post-quant TTT — 1.07197 ttt-sliding-window (3-seed mean, std 0.00023)
by BharathSShankarView on GitHub
val_bpb
1.0720
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.96 MB
Training Techniques
Architecture
depth recurrence
11-layer model with 3-layer recurrence loops over layers 3-5.
parameters: {"layers":11,"recurrence_loops":3,"recurrence_range":"3-5"}
Parallel Residuals
Parallel residual connections introduced from layer 7 onward.
parameters: {"start_layer":7}
LeakyReLU
LeakyReLU(0.5)^2 activation used in SwiGLU.
parameters: {"negative_slope":0.5}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
XSA
XSA attention used in all layers.
parameters: {"layers":11}
weight tying
Input and output embeddings are tied.
parameters: null
KV head count
Model uses 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
tokenizer
SP10240 tokenizer.
parameters: {"vocab_size":10240}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"Polar Express NS Muon"}
Regularization
SimCTG
parameters: {"lambda":0.3,"margin":0.4}
Quantization
STE QAT
bits: 6
scope: activations
GPTQ
bits: 6
scope: matrices
GPTQ
bits: 7
scope: token embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"enabled":true,"epochs":1,"learning_rate":0.005}
Compression
brotli
level: null
lzma
level: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- QAHSP quant-aware activation regularizer pushing hidden states onto an int6 grid during training
- Post-quant test-time training on already-graded eval tokens after the legal pre-quant grading pass
- Bug fix to eval_val_ttt enabling post-quant TTT to complete
- Record 3-seed mean result with low standard deviation