PR #486
closedRecord: 11L TrigramHash + ValueResidual + GradQuant + Cosine TTT (mean val_bpb=1.0887, best 1.0879)
by ndokutovichView on GitHub
val_bpb
1.1101
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.34 MB
Training Techniques
Architecture
TrigramHash
Adds a 3-token hashed embedding context before transformer blocks.
parameters: {"buckets":4096,"dim":128}
ValueResidual
Caches V vectors from the first attention layer and blends them into later layers with learned scalars.
parameters: null
BigramHash
Hashed bigram embedding used as part of the model input representation.
parameters: {"buckets":4096}
SmearGate
Custom gating component used in the MLP stack.
parameters: null
MLP3x
Three-layer MLP variant with relu-squared activations.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the head dimension.
parameters: {"dimensions":"16/64"}
GQA
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"layers":11,"heads":8,"kv_heads":4}
Quantization
mixed int5/int6/int7 QAT
bits: null
scope: all
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"learning_rate":0.0005}
Weight Averaging
EMA
parameters: {"decay":0.997}
Initialization
OrthoInit
Orthogonal initialization.
Regularization
layerwise LN scale
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","epochs":10,"freeze_blocks":0,"time_seconds":154}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"warmup_steps":1500}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- TrigramHash embedding extending bigram hashing to 3-token context
- Value Residual (ResFormer-style) cross-layer value blending
- Gradient-guided adaptive quantization with per-tensor sensitivity ranking
- Mixed precision quantization assigning Int7/Int6/Int5 based on gradient sensitivity