PR #1882

open

add 10L hybrid SWA 2048 record

by jmattewView on GitHub
val_bpb
1.2013
Architecture
Hybrid
Optimizer
AdamW
Artifact Size
15,551,524 bytes

Training Techniques

Architecture
weight tying
Kept tied token embeddings in fp16 during int8 export to reduce quantization loss.
parameters: null
sliding window attention
Hybrid attention with local sliding-window layers and periodic full-attention layers.
parameters: {"window":1024,"global_layer_stride":2}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"token_lr":0.07,"matrix_lr":0.06,"scalar_lr":0.05}
Regularization
weight decay
parameters: {"decoupled":true,"applied_to":"Muon matrix parameters"}
Weight Averaging
SWA
parameters: null
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
LR Schedule
warmdown
parameters: {"warmdown_iters":2500}

Novel Contributions

  • 10-layer hybrid transformer with sliding-window attention
  • 2048-token training sequence length
  • Periodic full-attention layers every 2 blocks
  • FP16 tied embedding passthrough during int8 export
  • AdamW token/scalar parameters with decoupled Muon matrix weight decay
  • SWA-based record submission under the 16MB limit