PR #1015

open

Add Parameter Golf submission: Vocab768_LinearPhaseInit_GatedXSA_EMA_…

val_bpb

1.2115

Architecture

Transformer

Optimizer

—

Artifact Size

15,082,805 bytes

Training Techniques

Architecture

XSA

Gated XSA applied to the last layers of the Transformer.

parameters: {"layers":2,"mode":"gated"}

Value Residual

Value bias/residual applied to the last layers.

parameters: {"layers":2,"dimension":128}

Quantization

late QAT

bits: null

scope: matrix_only

Weight Averaging

EMA

parameters: {"start_step":9094}

Initialization

phase-mix init

Linear phase-mix initialization.

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

FlashAttention 3 backend used for attention computation.

parameters: {"backend":"flash_attn_3"}

Custom sp768 tokenizer export with vocab size 768
Linear phase-mix initialization
Gated XSA on the last 2 layers
EMA during late training
Late matrix-only QAT
FlashAttention 3 backend
Tokenizer and dataset export published to Hugging Face and loaded via patched manifest-driven loader