PR #1234
openNon-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)
by ibarrajoView on GitHub
val_bpb
1.1461
Architecture
Transformer
Optimizer
AdamW
Artifact Size
16MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: attention + MLP weights
Architecture
BigramHash
Bigram hash embedding used in the model input representation.
parameters: {"dimensions":128,"vocab_size":6144}
XSA
XSA applied across layers / last 11 layers as part of the architecture.
parameters: {"layers":11}
SmearGate
SmearGate used as an architectural component.
parameters: null
U-Net skip connections
U-Net style skip connections in the network.
parameters: null
ReLU²
ReLU squared MLP activation.
parameters: null
KV head count
Uses grouped attention with separate KV heads.
parameters: {"heads":8,"kv_heads":8}
Compression
zstd
level: 22
Regularization
magnitude pruning
parameters: {"sparsity":0.1}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.0001}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Autoregressive self-generated GPTQ calibration using model-generated sequences instead of training data
- Hessian collection from self-generated calibration tokens for post-training quantization
- Reallocation of training budget to reserve time for AR generation and calibration
- Combination of self-generated GPTQ calibration with score-first TTT on the Approach B base