PR #110
openSubmission: Top-Heavy FFN Allocation + Packed Int6 Export | pending eval
by mr-ashish-pandayView on GitHub
val_bpb
1.2244
Architecture
Transformer
Optimizer
Muon
Artifact Size
4,273,390 bytes
Training Techniques
Architecture
MLP3x
Replaces uniform FFN width with OpenELM-style layer-wise top-heavy FFN scaling so later layers have larger feed-forward dimensions than earlier layers.
parameters: {"layers":9,"ffn_schedule":[768,960,1152,1344,1536,1728,1920,2112,2304]}
tied embeddings
Uses tied input embedding and output projection weights.
parameters: null
Quantization
int6
bits: 6
scope: large 2D matrices; fp16 for tied embedding
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"higher_momentum":true,"lower_lr":true,"warmdown":true,"gradient_clipping":true}
LR Schedule
warmdown
parameters: null
Other
other
CPU dry-run mode for local smoke testing without CUDA.
parameters: {"dry_run":true,"steps":10}
Novel Contributions
- Top-heavy FFN allocation using OpenELM-style layer-wise scaling instead of a uniform 3x FFN.
- Exact packed int6 export path with per-row fp16 scales.
- Keeping the tied embedding in fp16 to preserve quantization-sensitive weights.
- Self-contained artifact export that avoids relying on external zstd at evaluation time.
- Sliding-window evaluation for improved scoring.
- CPU DRY_RUN=1 mode for local verification without GPU access.