PR #659

closed

Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM

by deanbrrView on GitHub

val_bpb

1.0920

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

MLP3x

Three-layer MLP block

parameters: null

LeakyReLU

LeakyReLU(0.5) squared in the MLP

parameters: {"slope":0.5,"power":2}

BigramHash

BigramHash component used in the architecture

parameters: {"size":1536}

XSA

XSA applied to the last 4 layers

parameters: {"layers":4}

RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16,"total_dimensions":64}

VE128

VE128 applied to layers 9-10

parameters: {"layers":[9,10]}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997,"swa_interval":50}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

lzma

level: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Evaluation

sliding window eval

parameters: {"stride":128}

online n-gram cache eval

parameters: {"ngram_max_n":5,"confidence_threshold":0.5,"min_count":3,"ngram_lambda":0.15}

Test-Time Training

TTT disabled

parameters: null

Novel Contributions

Online 5-gram evaluation cache accumulated from already-scored tokens during sliding-window validation
Confidence-gated log-sum-exp mixing with a safety gate that can never worsen a prediction
Strictly backward-looking CPU-only n-gram lookup strategy with zero GPU cost
Eval-time improvement only, with no training changes to the base model
Stride-based evaluation configuration tuned to fit within the time budget