PR #1444

open

Add non-record v1 1xH100 LeakyReLU GPTQ-lite submission

by hypnoasticView on GitHub
val_bpb
1.3081
Architecture
Transformer
Optimizer
Artifact Size
12889311 bytes

Training Techniques

Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
LeakyReLU
MLP activation changed from ReLU^2 to LeakyReLU(0.5)^2
parameters: {"slope":0.5}
Partial RoPE
Uses Partial RoPE in the attention stack
parameters: {"ratio":"16/64"}
XSA
XSA enabled on the last 4 layers
parameters: {"layers":4}
weight tying
Tied embeddings / tied embedding weights
parameters: null
MLP3x
Uses a 3x-MLP stack
parameters: null
VE128
Value residual enabled on layers 9 and 10
parameters: {"layers":[9,10],"dimensions":128}
Weight Averaging
EMA
parameters: null
SWA
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Evaluation
sliding window eval
parameters: null
Compression
zstd
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Non-record 16MB submission folder for the v1 1xH100 screening run
  • LeakyReLU(0.5)^2 MLP activation change ported from a March 23 record
  • Explicit conservative roundtrip metric recorded in submission.json instead of the legacy trailing label
  • Included exact train log, submission metadata, and code snapshot used for the run
  • Documented the additional sliding-window evaluation line printed by the script