PR #1635

closed

Non-record: TWEO early-cosine outlier regularization on SP1024 baseline

by PapaFranku4647View on GitHub
val_bpb
1.1063
Architecture
Transformer
Optimizer
Artifact Size
~15.94 MB

Training Techniques

Quantization
mixed int6/int8
bits: 6
scope: model weights
Architecture
depth recurrence
Repeated a small middle portion of the stack, with recurrence enabled only mid-training and MLPs untied in the repeated block.
parameters: {"layers":[4,5],"num_layers":11,"start_step":3000,"untie_mlp":true}
parallel residuals
Split attention and MLP into separate residual lanes starting from a later layer, with learned routing between lanes.
parameters: {"start_layer":7}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: null

Novel Contributions

  • Mixed-quantization GPTQ path with autoregressive self-generated calibration
  • Parallel residual lanes for attention and MLP starting at layer 7
  • Mini depth recurrence over layers 4 and 5
  • Delayed activation of recurrence mid-training
  • Untying repeated MLP weights in the recurrent block
  • Artifact kept under the 16MB limit