PR #1052

closed

Merge: Autoresearch/mar28 experiments on 4xH20

by demouoView on GitHub

val_bpb

1.1978

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmdown_schedule":true}

Quantization

mixed int6

bits: 6

scope: artifact

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: {"decay":[0.995,0.997]}

Architecture

MLP width

Expanded MLP width from 3x to 3.5x

parameters: {"from":3,"to":3.5}

LeakyReLU

Used LeakyReLU squared activation

parameters: {"power":2,"slope":0.5}

BigramHash

Character bigram hash embeddings

parameters: {"dimensions":4096}

MLP4x

Removed bigram and used a larger MLP

parameters: null

MHA

Added full multi-head attention

parameters: {"kv_heads":8}

Compression

zstd

level: 22

Sequence Length

sequence_length

train_length: 8192

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Test-Time Training

full TTT

parameters: {"chunk":65536}

Novel Contributions

Muon optimizer tuning with weight decay, momentum, and warmdown schedule
Mixed-precision int6 quantization to fit the artifact under 16MB
Sliding window evaluation with stride 64
EMA weight averaging
BigramHash character embeddings
Sequence packing to 8192 tokens
MLP width expansion and LeakyReLU squared activation
Full multi-head attention with 8 KV heads
Test-time training with large chunk size