PR #738

open

Record: VRL + Full GPTQ + 5-gram Cache + Hidden-State kNN-LM (3-seed mean val_bpb=1.0970)

by gowtham0992View on GitHub

val_bpb

1.0970

Architecture

Transformer

Optimizer

—

Artifact Size

~15.7 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: all

Architecture

VRL

Value Residual Learning added to the base stack

parameters: null

MLP3x

3x MLP expansion with LeakyReLU(0.5)^2

parameters: {"layers":11,"dimensions":512}

BigramHash

BigramHash component used in the model stack

parameters: {"size":2048}

XSA

XSA applied across all layers

parameters: {"layers":11}

RoPE

Partial rotary positional embeddings

parameters: {"dimensions":16,"base_dimensions":64}

KV head count

8 attention heads with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"frequency":50,"description":"tight SWA every 50 steps"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"window":128}

Other

other

Online 5-gram cache with adaptive lambda and pre-committed confidence gate

parameters: {"n":5,"threshold":0.7,"min_observations":3}

other

Hidden-state kNN-LM using a GPU ring buffer of 512-dim hidden states with L2 nearest neighbors and RBF kernel distribution

parameters: {"hidden_dim":512,"k":32,"buffer_size":30000,"temperature":50}

Novel Contributions

Hidden-State kNN-LM using stored 512-dim hidden states in a GPU ring buffer
Online 5-gram cache with adaptive lambda and pre-committed confidence gate
GPTQ calibration performed inside the training budget to satisfy competition constraints
Combination of n-gram cache and kNN cache for additive evaluation-time gains
VRL-based base stack with full GPTQ quantization