PR #1144

open

Add PartialRoPE 16/64 experiment records

val_bpb
1.3572
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.47MB

Training Techniques

Architecture
Partial RoPE
Applied rotary position embeddings to only a subset of attention head dimensions, leaving the rest position-free.
parameters: {"dimensions":16,"total_dimensions":64,"fraction":0.25}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
SmearGate
Included SmearGate in the model architecture.
parameters: null
BigramHash
Used BigramHashEmbedding for token representation.
parameters: {"bigrams":10240,"dim":128}
MLP3x
Used a 3.0x MLP expansion.
parameters: {"multiplier":3}
ReLU²
Used ReLU squared activation in the MLP.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Weight Averaging
SWA
parameters: {"start_phase":"warmdown","checkpoint_interval":25}
LR Schedule
warmdown
parameters: {"warmdown_steps":1000}
Regularization
magnitude pruning
parameters: {"sparsity":0.1}
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"embeddings_scalars_optimizer":"AdamW"}
Quantization
mixed int5/int6
bits: null
scope: MLP and attention weights

Novel Contributions

  • Partial RoPE applied to 16 of 64 attention dimensions
  • Longer training sequence length of 1024
  • Reduced batch tokens to increase optimization steps under wallclock limit
  • Warmdown and SWA tuning for the run
  • Aggressive magnitude pruning before quantization
  • Reproducibility records with three seeds