PR #1430
openRecord: Per-Sample SLOT + N-gram Order-22 + TTT + LR=0.432 — val_bpb 0.39642 (3-seed mean)
by renqianluoView on GitHub
val_bpb
0.3964
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.90MB
Training Techniques
Architecture
BigramHash
Causal backoff n-gram mixer using hashed n-gram probabilities up to order 22.
parameters: {"order":22,"buckets":4000000}
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"steps":24,"lr_start":0.432,"lr_end":0.001,"beta1":0.6,"beta2":0.5,"batch_size":128}
LR Schedule
cosine decay
parameters: {"lr_start":0.432,"lr_end":0.001}
Test-Time Training
full TTT
parameters: {"optimizer":"AdamW","epochs":1,"learning_rate":0.001,"freeze_blocks":"0-9","second_pass_fraction":0.1,"floor_lr":0.0001}
Quantization
GPTQ
bits: 6
scope: model weights
Other
other
Per-sample SLOT optimization with sequence-specific hidden delta and logit bias.
parameters: {"hidden_delta_shape":"[bsz,1,512]","logit_bias_shape":"[bsz,1,1024]","params":1536}
other
Multi-token prediction with 2 heads and auxiliary loss.
parameters: {"heads":2,"loss_weight":0.1}
Novel Contributions
- Per-sample SLOT with sequence-specific hidden delta and logit bias
- Causal backoff n-gram mixer with entropy-adaptive blending
- Order-22 hashed n-gram table within artifact budget
- Test-time training with a second pass over the first 10% of chunks
- GPTQ int6 quantization with damp=0.005
- Multi-token prediction with two heads