PR #554

open

Non-record: 11L + XSA4 H100 frontier (1.4612 BPB legal)

val_bpb

1.4612

Architecture

Transformer

Optimizer

—

Artifact Size

15,983,603 bytes

Training Techniques

Architecture

XSA

Cross-layer self-attention applied only in the final 4 layers

parameters: {"layers":4}

tied embeddings

Input and output embeddings are tied

parameters: null

MLP activation

ReLU squared activation in MLP layers

parameters: null

Sequence Length

sequence_length

train_length: 256

eval_length: 256

LR Schedule

warmdown

parameters: {"warmdown_iters":200}

Compression

custom packed_zstd

level: null

Evaluation

stride-based eval

parameters: {"stride":256}

Other

other

Checkpoint frontier saving every 25 steps

parameters: {"checkpoint_interval_steps":25}

Use of XSA (cross-layer self-attention) only in the final 4 layers of an 11-layer decoder-only transformer
ReLU^2 activation in MLP layers
Single H100 80GB hardware training with a 16MB artifact size submission
Custom packed serialization with packed_zstd compression
Checkpoint frontier saving every 25 steps
Demonstration that artifact size is the main bottleneck rather than raw BPB