PR #1635
closedNon-record: TWEO early-cosine outlier regularization on SP1024 baseline
by PapaFranku4647View on GitHub
val_bpb
1.1063
Architecture
Transformer
Optimizer
—
Artifact Size
~15.94 MB
Training Techniques
Quantization
mixed int6/int8
bits: 6
scope: model weights
Architecture
depth recurrence
Repeated a small middle portion of the stack, with recurrence enabled only mid-training and MLPs untied in the repeated block.
parameters: {"layers":[4,5],"num_layers":11,"start_step":3000,"untie_mlp":true}
parallel residuals
Split attention and MLP into separate residual lanes starting from a later layer, with learned routing between lanes.
parameters: {"start_layer":7}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: null
Novel Contributions
- Mixed-quantization GPTQ path with autoregressive self-generated calibration
- Parallel residual lanes for attention and MLP starting at layer 7
- Mini depth recurrence over layers 4 and 5
- Delayed activation of recurrence mid-training
- Untying repeated MLP weights in the recurrent block
- Artifact kept under the 16MB limit