PR #1408

open

Record: dTTT + BigramHash 3072×112 — val_bpb 1.0800 (3-seed mean)

by aamodbhattView on GitHub
val_bpb
1.0800
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.9 MB

Training Techniques

Architecture
BigramHash
Increased bigram hash feature capacity for more expressive n-gram context features.
parameters: {"vocab_size":3072,"dimension":112}
XSA
Applied XSA across all layers.
parameters: {"layers":11}
Partial RoPE
Used partial rotary positional embeddings.
parameters: {"dimensions":16}
VE128
Added Value Residual / VE module.
parameters: {"dim":128,"layers":[9,10]}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":10,"freeze_blocks":0,"per_block_lr_scale":"0.3x-1.0x"}
Quantization
GPTQ
bits: 6
scope: all
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"learning_rate":0.0005}
Weight Averaging
SWA
parameters: {"every_steps":50}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
LR Schedule
cosine decay
parameters: {"warmdown_steps":4000}

Novel Contributions

  • BigramHash 3072×112 expansion for more expressive n-gram features
  • Discriminative TTT with per-block adaptive learning rates
  • Full Hessian GPTQ int6 quantization with XSA-all layers
  • Record-setting 3-seed mean val_bpb of 1.0800