PR #764
openRecord: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)
by ndokutovichView on GitHub
val_bpb
0.9633
Architecture
Transformer
Optimizer
—
Artifact Size
15.56 MB
Training Techniques
Architecture
MLP3x
Transformer with 3x MLP and LeakyReLU(0.9)^2 activation; also includes XSA, BigramHash, SmearGate, SWA, EMA.
parameters: {"layers":11,"dimensions":512,"gqa":"8/4"}
Quantization
int6 QAT + GPTQ
bits: 6
scope: all
Weight Averaging
EMA
parameters: null
SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
7-gram backoff
parameters: {"order":7}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"freeze_last_blocks":2}
Other
other
Curriculum learning via shard reordering by model perplexity, hardest shards first.
parameters: {"shard_ordering":"hardest_first"}
other
LeakyReLU slope optimization from 0.5 to 0.9.
parameters: {"slope":0.9}
Novel Contributions
- Curriculum learning via shard reordering by model perplexity
- LeakyReLU(0.9)^2 slope optimization
- 7-gram backoff evaluation cache
- Legal score-first test-time training
- Built on PR #753 with combined improvements