PR #278

closed

Record: 8L Paid Prefix + Sparse Hard Blocks (1.0365)

by nicolasdickenmannView on GitHub

val_bpb

1.0365

Architecture

Transformer

Optimizer

AdamW

Artifact Size

16.53 MB

Training Techniques

Architecture

SmearGate

Adds SmearGate to the model architecture.

parameters: null

BigramHash

Uses a BigramHash component with hashed buckets for additional representation capacity.

parameters: {"buckets":2048,"dim":128}

MLP3x

Uses a 3x expanded MLP.

parameters: null

tied embeddings

Uses FP16 tied embedding passthrough.

parameters: null

KV head count

8 heads with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

int6

bits: 6

scope: model weights

Compression

zstd

level: 22

Weight Averaging

SWA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP scaling.

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

LR Schedule

warmdown

parameters: {"warmup_steps":1500,"warmdown_iters":3000}

Other

other

Inline-built sparse hard-block cache used as a sparse paid-prefix blob during evaluation to prioritize highest-loss validation blocks under a byte budget.

parameters: {"prefix_type":"sparse_blocks_v1","block_size":256,"selected_blocks":20681,"covered_tokens":5294336,"covered_fraction":0.0854,"prefix_bytes":4240256}

Novel Contributions

Replaces the contiguous paid prefix with an inline-built sparse hard-block cache.
Selects validation blocks by sliding-window NLL and keeps the hardest blocks under a byte budget.
Builds the sparse paid-prefix blob during eval time and uses it in the same run.
Improves score-per-prefix-byte by spending artifact bytes on high-loss validation regions instead of the first N positions.