PR #2055

open

Submission: RationalRaven — sliding64 mean 1.139957 (3-seed)

by twistloopView on GitHub
val_bpb
1.1400
Architecture
Transformer
Optimizer
Artifact Size
15,598,112 B

Training Techniques

Sequence Length
sequence_length
train_length: 4096
eval_length: null
Architecture
MLP3x
MLP widened to 3.25x
parameters: {"multiplier":3.25}
LeakyReLU
Uses squared LeakyReLU activation
parameters: {"power":2}
weight tying
Tied input and output embeddings
parameters: null
Quantization
late QAT
bits: 8
scope: attn/KV
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • 3-seed locked submission with reported mean score
  • Single recipe combining sp4096 training, widened MLP, squared LeakyReLU, late int8 QAT for attention/KV, and tied embeddings
  • Submitted artifact is the seed 1339 run with byte audit under the 16 MB cap
  • Uses sliding window evaluation with stride 64