PR #764

open

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)

val_bpb

0.9633

Architecture

Transformer

Optimizer

—

Artifact Size

15.56 MB

Training Techniques

Architecture

MLP3x

Transformer with 3x MLP and LeakyReLU(0.9)^2 activation; also includes XSA, BigramHash, SmearGate, SWA, EMA.

parameters: {"layers":11,"dimensions":512,"gqa":"8/4"}

Quantization

int6 QAT + GPTQ

bits: 6

scope: all

Weight Averaging

EMA

parameters: null

SWA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

7-gram backoff

parameters: {"order":7}

Test-Time Training

score-first TTT

parameters: {"epochs":3,"freeze_last_blocks":2}

Other

other

Curriculum learning via shard reordering by model perplexity, hardest shards first.

parameters: {"shard_ordering":"hardest_first"}

other

LeakyReLU slope optimization from 0.5 to 0.9.

parameters: {"slope":0.9}