PR #2164

open

Submission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]

by vimetoView on GitHub

val_bpb

1.0554

Architecture

Transformer

Optimizer

AdamW

Artifact Size

17.20 MB

Training Techniques

Architecture

SmearGate

Uses sparse attention gating and related attention-path modifications in the PR #2140 lineage.

parameters: {"enabled":1,"scale":0.5}

LeakyReLU

Uses leaky ReLU activation in the model.

parameters: {"slope":0.3}

Regularization

logit softcap

parameters: {"asymmetric":true,"learnable_scalars":["softcap_pos","softcap_neg"]}

Quantization

GPTQ

bits: 7

scope: embeddings

GPTQ

bits: 6

scope: matrix/block weights

GPTQ-lite

bits: 8

scope: all

mixed int6/int7/int8

bits: null

scope: mixed

Sequence Length

sequence_length

train_length: 3072

eval_length: 3072

Evaluation

stride-based eval

parameters: {"stride":1536}

Test-Time Training

LoRA TTT

parameters: {"rank":80,"learning_rate":0.0001}

LR Schedule

warmdown

parameters: {"warmdown_frac":0.85}

Other

other

Asymmetric logit rescale with separate positive/negative softcap parameters.

parameters: {"enabled":1}

Novel Contributions

Asymmetric logit rescale with separate softcap_pos and softcap_neg scalars
Cap-fit bit allocation by changing EMBED_BITS to 7 and MLP_CLIP_SIGMAS to 11.5
GPTQ calibration batch increase from 16 to 32
GPTQ reserve time reduction from 4.0 to 2.0 seconds
Port of PR #2140 with H100-targeted cap-compliance adjustments