PR #1438

open

Non-record: Mixed-Int6 LZMA9 B3072 Warm5000

val_bpb

1.2029

Architecture

Transformer

Optimizer

—

Artifact Size

15,991,188 bytes

Training Techniques

Weight Averaging

EMA

parameters: {"decay":0.997}

Architecture

XSA

XSA applied to the last 4 layers

parameters: {"layers":4}

BigramHash

Bigram hash embedding stack

parameters: {"vocab_size":3072,"dimensions":128}

LeakyReLU

LeakyReLU squared activation

parameters: {"squared":true,"slope":0.5}

Quantization

mixed int6

bits: 6

scope: mlp;attn;embed

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: {"seed":42}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":5000}

Other

other

Long single-GPU unlimited-compute training run

parameters: {"steps":16000,"training_time_seconds":57979.039}

Non-record unlimited-compute 16MB submission
Longer single-GPU training of the EMA + XSA(last-4) + BigramHash3072 + LeakyReLU^2 flat-transformer stack
Broad mixed-int6 export over mlp;attn;embed
LZMA9 extreme compression to fit the 16MB artifact cap
Preserved raw checkpoint and re-exported it independently for evaluation