PR #2164
openSubmission: Asymmetric Logit Rescale + cap-fit bit allocation (PR #2140 fork) [draft, pending 8×H100]
by vimetoView on GitHub
val_bpb
1.0554
Architecture
Transformer
Optimizer
AdamW
Artifact Size
17.20 MB
Training Techniques
Architecture
SmearGate
Uses sparse attention gating and related attention-path modifications in the PR #2140 lineage.
parameters: {"enabled":1,"scale":0.5}
LeakyReLU
Uses leaky ReLU activation in the model.
parameters: {"slope":0.3}
Regularization
logit softcap
parameters: {"asymmetric":true,"learnable_scalars":["softcap_pos","softcap_neg"]}
Quantization
GPTQ
bits: 7
scope: embeddings
GPTQ
bits: 6
scope: matrix/block weights
GPTQ-lite
bits: 8
scope: all
mixed int6/int7/int8
bits: null
scope: mixed
Sequence Length
sequence_length
train_length: 3072
eval_length: 3072
Evaluation
stride-based eval
parameters: {"stride":1536}
Test-Time Training
LoRA TTT
parameters: {"rank":80,"learning_rate":0.0001}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.85}
Other
other
Asymmetric logit rescale with separate positive/negative softcap parameters.
parameters: {"enabled":1}
Novel Contributions
- Asymmetric logit rescale with separate softcap_pos and softcap_neg scalars
- Cap-fit bit allocation by changing EMBED_BITS to 7 and MLP_CLIP_SIGMAS to 11.5
- GPTQ calibration batch increase from 16 to 32
- GPTQ reserve time reduction from 4.0 to 2.0 seconds
- Port of PR #2140 with H100-targeted cap-compliance adjustments