PR #2055

open

Submission: RationalRaven — sliding64 mean 1.139957 (3-seed)

val_bpb

1.1400

Architecture

Transformer

Optimizer

—

Artifact Size

15,598,112 B

Training Techniques

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Architecture

MLP3x

MLP widened to 3.25x

parameters: {"multiplier":3.25}

LeakyReLU

Uses squared LeakyReLU activation

parameters: {"power":2}

weight tying

Tied input and output embeddings

parameters: null

Quantization

late QAT

bits: 8

scope: attn/KV

Evaluation

sliding window eval

parameters: {"stride":64}

3-seed locked submission with reported mean score
Single recipe combining sp4096 training, widened MLP, squared LeakyReLU, late int8 QAT for attention/KV, and tied embeddings
Submitted artifact is the seed 1339 run with byte audit under the 16 MB cap
Uses sliding window evaluation with stride 64