PR #327

open

Submission TrigramHash + PartialRoPE + HeadTemp + stride32 (val_bpb: 1.1450)and

by AnanddnaView on GitHub

val_bpb

1.1450

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

TrigramHash

Adds learned hashed embeddings for consecutive token triplets to capture 3-token patterns as atomic units.

parameters: {"buckets":8192,"dim":64}

Partial RoPE

Applies rotary position embeddings to only part of each attention head dimension, leaving the rest position-free.

parameters: {"fraction":0.5}

Per-head temperature scaling

Learns a separate temperature parameter for each attention head to vary attention sharpness.

parameters: null

BigramHash

Uses hashed embeddings for token pairs as a complementary n-gram feature.

parameters: {"buckets":10240}

SmearGate

Gating component used in the model architecture.

parameters: null

MLP3x

Uses a 3x expanded MLP hidden size.

parameters: {"expansion":3}

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

tied embeddings

Shares embedding weights.

parameters: null

U-Net skip connections

Adds skip connections in a U-Net-like pattern across layers.

parameters: null

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

LoRA TTT

parameters: {"rank":4}

Weight Averaging

SWA

parameters: {"frac":0.4,"every":50}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Compression

zstd

level: 22

Novel Contributions

TrigramHashEmbedding for hashing token triplets into learned embeddings
Partial RoPE applied to only 50% of head dimensions
Per-head temperature scaling in attention
Reduced evaluation stride from 64 to 32
LoRA-based test-time training infrastructure