PR #1051

open

WIP: LeakyReLU(0.5)² MLP on 11L EMA + GPTQ-lite stack (`track_10min_16mb`)

by tejas-goyalView on GitHub

val_bpb

1.2826

Architecture

Transformer

Optimizer

—

Artifact Size

7,804,166 bytes

Training Techniques

Architecture

MLP3x

Replaces ReLU-squared MLP activation with LeakyReLU(negative_slope=0.5) followed by square() in the 3x MLP.

parameters: {"negative_slope":0.5}

Partial RoPE

Uses partial rotary positional embeddings as part of the parent stack.

parameters: null

XSA

Includes XSA attention modification from the parent stack.

parameters: null

VE128

Includes VE128 component from the parent stack.

parameters: null

SmearGate

Includes SmearGate component from the parent stack.

parameters: null

BigramHash

Includes BigramHash component from the parent stack.

parameters: null

Weight Averaging

EMA

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: model weights

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

LN scale

parameters: null

Novel Contributions

Swaps the parent record's ReLU-squared MLP activation for LeakyReLU(0.5)-squared with no extra parameters.
Builds on an 11-layer EMA + GPTQ-lite + warmdown3500 + QAT@0.15 stack.
Provides a WIP submission folder with smoke-run logs and reproducible training/export scripts.
Uses GPTQ-lite int6 export and sliding-window evaluation for the 10-minute / 16 MB track.