PR #1635

closed

Non-record: TWEO early-cosine outlier regularization on SP1024 baseline

by PapaFranku4647View on GitHub

val_bpb

1.1063

Architecture

Transformer

Optimizer

—

Artifact Size

~15.94 MB

Training Techniques

Quantization

mixed int6/int8

bits: 6

scope: model weights

Architecture

depth recurrence

Repeated a small middle portion of the stack, with recurrence enabled only mid-training and MLPs untied in the repeated block.

parameters: {"layers":[4,5],"num_layers":11,"start_step":3000,"untie_mlp":true}

parallel residuals

Split attention and MLP into separate residual lanes starting from a later layer, with learned routing between lanes.

parameters: {"start_layer":7}

Evaluation

sliding window eval

parameters: null

Compression

brotli

level: null