PR #1945
RECORDopenRecord: PR #1908 base + AWQ-lite + Asymmetric Logit Rescale - val_bpb 1.05932 (3-seed mean)
by alertcatView on GitHub
val_bpb
1.0593
Architecture
Transformer
Optimizer
—
Artifact Size
15,986,941 bytes
Training Techniques
Quantization
GPTQ-lite
bits: 6
scope: all
Architecture
SmearGate
Causal residual gate using a 1-token lookback and a content-conditioned sigmoid gate over the first feature dimensions.
parameters: {"window":12}
SparseAttnGate
Sparse per-head multiplicative gate inside attention.
parameters: null
Regularization
logit softcap
parameters: {"asymmetric":true,"pos":"softcap_pos","neg":"softcap_neg"}
Test-Time Training
LoRA TTT
parameters: {"phases":3,"score_first":true}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Compression
brotli
level: null
Novel Contributions
- Combines PR #1908 AWQ-lite quantization with asymmetric logit rescaling at eval time.
- Shows that asymmetric logit rescale improves TTT recovery when paired with AWQ-lite quantization.
- Uses eval-only surgical edits to train_gpt.py while preserving the training path.
- Achieves a 3-seed mean val_bpb of 1.05932 under the competition constraints.