PR #764

open

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)

by ndokutovichView on GitHub
val_bpb
0.9633
Architecture
Transformer
Optimizer
Artifact Size
15.56 MB

Training Techniques

Architecture
MLP3x
Transformer with 3x MLP and LeakyReLU(0.9)^2 activation; also includes XSA, BigramHash, SmearGate, SWA, EMA.
parameters: {"layers":11,"dimensions":512,"gqa":"8/4"}
Quantization
int6 QAT + GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
7-gram backoff
parameters: {"order":7}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"freeze_last_blocks":2}
Other
other
Curriculum learning via shard reordering by model perplexity, hardest shards first.
parameters: {"shard_ordering":"hardest_first"}
other
LeakyReLU slope optimization from 0.5 to 0.9.
parameters: {"slope":0.9}

Novel Contributions

  • Curriculum learning via shard reordering by model perplexity
  • LeakyReLU(0.9)^2 slope optimization
  • 7-gram backoff evaluation cache
  • Legal score-first test-time training
  • Built on PR #753 with combined improvements