PR #1095

open

Record: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905

by vimetoView on GitHub

val_bpb

0.0905

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.09 MB

Training Techniques

Initialization

OrthoInit

Frozen orthogonal random projections regenerated from deterministic 8-byte seeds at load time.

Quantization

int8

bits: 8

scope: LoRA adapters

Compression

lzma

level: 9

Architecture

LeakyReLU

Uses LeakyReLU(0.5) squared activation in the MLP.

parameters: {"slope":0.5,"squared":true}

weight tying

Uses tied embeddings.

parameters: null

GQA

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

MLP3x

Uses an MLP multiplier of 3.0.

parameters: {"multiplier":3}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

AdamW

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Incrementally builds an n-gram cache during training and freezes it for evaluation; blends neural and n-gram probabilities with entropy-adaptive alpha.

parameters: {"ngram_orders":"2-7","cache_type":"INT16","multi_gpu_sync":"all_reduce"}

Novel Contributions

Seed-regenerated frozen orthogonal random base weights stored as 8-byte seeds instead of full matrices
Incremental n-gram cache built during training with negligible overhead
Entropy-adaptive blending of neural and n-gram probabilities
INT8 quantization of LoRA adapters with small BPB loss
Orthogonal initialization enabling stable training of deeper random-weight models