PR #1144

open

Add PartialRoPE 16/64 experiment records

by inFaaaView on GitHub

val_bpb

1.3572

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.47MB

Training Techniques

Architecture

Partial RoPE

Applied rotary position embeddings to only a subset of attention head dimensions, leaving the rest position-free.

parameters: {"dimensions":16,"total_dimensions":64,"fraction":0.25}

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

SmearGate

Included SmearGate in the model architecture.

parameters: null

BigramHash

Used BigramHashEmbedding for token representation.

parameters: {"bigrams":10240,"dim":128}

MLP3x

Used a 3.0x MLP expansion.

parameters: {"multiplier":3}

ReLU²

Used ReLU squared activation in the MLP.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Weight Averaging

SWA

parameters: {"start_phase":"warmdown","checkpoint_interval":25}

LR Schedule

warmdown

parameters: {"warmdown_steps":1000}

Regularization

magnitude pruning

parameters: {"sparsity":0.1}

Compression

zstd

level: 22

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"embeddings_scalars_optimizer":"AdamW"}

Quantization

mixed int5/int6

bits: null

scope: MLP and attention weights

Novel Contributions

Partial RoPE applied to 16 of 64 attention dimensions
Longer training sequence length of 1024
Reduced batch tokens to increase optimization steps under wallclock limit
Warmdown and SWA tuning for the run
Aggressive magnitude pruning before quantization
Reproducibility records with three seeds