PR #278

closed

Record: 8L Paid Prefix + Sparse Hard Blocks (1.0365)

by nicolasdickenmannView on GitHub
val_bpb
1.0365
Architecture
Transformer
Optimizer
AdamW
Artifact Size
16.53 MB

Training Techniques

Architecture
SmearGate
Adds SmearGate to the model architecture.
parameters: null
BigramHash
Uses a BigramHash component with hashed buckets for additional representation capacity.
parameters: {"buckets":2048,"dim":128}
MLP3x
Uses a 3x expanded MLP.
parameters: null
tied embeddings
Uses FP16 tied embedding passthrough.
parameters: null
KV head count
8 heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
int6
bits: 6
scope: model weights
Compression
zstd
level: 22
Weight Averaging
SWA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP scaling.
Optimizer
AdamW
weight_decay: 0.04
momentum: null
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
LR Schedule
warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3000}
Other
other
Inline-built sparse hard-block cache used as a sparse paid-prefix blob during evaluation to prioritize highest-loss validation blocks under a byte budget.
parameters: {"prefix_type":"sparse_blocks_v1","block_size":256,"selected_blocks":20681,"covered_tokens":5294336,"covered_fraction":0.0854,"prefix_bytes":4240256}

Novel Contributions

  • Replaces the contiguous paid prefix with an inline-built sparse hard-block cache.
  • Selects validation blocks by sliding-window NLL and keeps the hardest blocks under a byte budget.
  • Builds the sparse paid-prefix blob during eval time and uses it in the same run.
  • Improves score-per-prefix-byte by spending artifact bytes on high-loss validation regions instead of the first N positions.