PR #110

open

Submission: Top-Heavy FFN Allocation + Packed Int6 Export | pending eval

by mr-ashish-pandayView on GitHub

val_bpb

1.2244

Architecture

Transformer

Optimizer

Muon

Artifact Size

4,273,390 bytes

Training Techniques

Architecture

MLP3x

Replaces uniform FFN width with OpenELM-style layer-wise top-heavy FFN scaling so later layers have larger feed-forward dimensions than earlier layers.

parameters: {"layers":9,"ffn_schedule":[768,960,1152,1344,1536,1728,1920,2112,2304]}

tied embeddings

Uses tied input embedding and output projection weights.

parameters: null

Quantization

int6

bits: 6

scope: large 2D matrices; fp16 for tied embedding

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"higher_momentum":true,"lower_lr":true,"warmdown":true,"gradient_clipping":true}

LR Schedule

warmdown

parameters: null

Other

other

CPU dry-run mode for local smoke testing without CUDA.

parameters: {"dry_run":true,"steps":10}

Novel Contributions

Top-heavy FFN allocation using OpenELM-style layer-wise scaling instead of a uniform 3x FFN.
Exact packed int6 export path with per-row fp16 scales.
Keeping the tied embedding in fp16 to preserve quantization-sensitive weights.
Self-contained artifact export that avoids relying on external zstd at evaluation time.
Sliding-window evaluation for improved scoring.
CPU DRY_RUN=1 mode for local verification without GPU access.