PR #738
openRecord: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970)
by gowtham0992View on GitHub
val_bpb
1.0970
Architecture
Transformer
Optimizer
—
Artifact Size
~15.7 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: all
Architecture
VRL
Value Residual Learning added to the base stack
parameters: null
MLP3x
3x MLP expansion with LeakyReLU(0.5)^2
parameters: {"layers":11,"dimensions":512}
BigramHash
BigramHash component used in the model stack
parameters: {"size":2048}
XSA
XSA applied across all layers
parameters: {"layers":11}
RoPE
Partial rotary positional embeddings
parameters: {"dimensions":16,"base_dimensions":64}
KV head count
8 attention heads with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"description":"tight SWA every 50 steps"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"window":128}
Other
other
Online 5-gram cache with adaptive lambda and pre-committed confidence gate
parameters: {"n":5,"threshold":0.7,"min_observations":3}
other
Hidden-state kNN-LM using a GPU ring buffer of 512-dim hidden states with L2 nearest neighbors and RBF kernel distribution
parameters: {"hidden_dim":512,"k":32,"buffer_size":30000,"temperature":50}
Novel Contributions
- Hidden-State kNN-LM using stored 512-dim hidden states in a GPU ring buffer
- Online 5-gram cache with adaptive lambda and pre-committed confidence gate
- GPTQ calibration performed inside the training budget to satisfy competition constraints
- Combination of n-gram cache and kNN cache for additive evaluation-time gains
- VRL-based base stack with full GPTQ quantization