PR #1444

open

Add non-record v1 1xH100 LeakyReLU GPTQ-lite submission

by hypnoasticView on GitHub

val_bpb

1.3081

Architecture

Transformer

Optimizer

—

Artifact Size

12889311 bytes

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

LeakyReLU

MLP activation changed from ReLU^2 to LeakyReLU(0.5)^2

parameters: {"slope":0.5}

Partial RoPE

Uses Partial RoPE in the attention stack

parameters: {"ratio":"16/64"}

XSA

XSA enabled on the last 4 layers

parameters: {"layers":4}

weight tying

Tied embeddings / tied embedding weights

parameters: null

MLP3x

Uses a 3x-MLP stack

parameters: null

VE128

Value residual enabled on layers 9 and 10

parameters: {"layers":[9,10],"dimensions":128}

Weight Averaging

EMA

parameters: null

SWA

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Evaluation

sliding window eval

parameters: null

Compression

zstd

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

Non-record 16MB submission folder for the v1 1xH100 screening run
LeakyReLU(0.5)^2 MLP activation change ported from a March 23 record
Explicit conservative roundtrip metric recorded in submission.json instead of the legacy trailing label
Included exact train log, submission metadata, and code snapshot used for the run
Documented the additional sliding-window evaluation line printed by the script