PR #330

open

Non-record: 11L Int6 + Online Logit Bias (val_bpb=1.1609)

by bopmiteView on GitHub

val_bpb

1.1609

Architecture

Transformer

Optimizer

Muon

Artifact Size

13,977,633 bytes

Training Techniques

Quantization

int6

bits: 6

scope: all weights per-row

Architecture

MLP3x

3x MLP with 1536 hidden size

parameters: {"hidden_size":1536}

GQA

Grouped-query attention with 8/4 heads

parameters: {"query_heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied

parameters: null

SmearGate

Custom gating mechanism used in the model

parameters: null

BigramHash

Bigram hash feature module

parameters: {"size":"2048x128"}

RoPE

Rotary positional embeddings with NTK scaling

parameters: {"sequence_length":2048}

Partial RoPE

Applies RoPE to only part of the dimensions

parameters: {"dimensions":"16/64"}

XSA

XSA applied on the last 4 layers

parameters: {"layers":4}

Initialization

OrthoInit

Orthogonal initialization combined with muP

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN Scale

parameters: null

weight decay

parameters: {"value":0.04}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

online logit bias

parameters: {"learning_rate":0.1,"enabled":false}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Compression

zstd

level: null

Novel Contributions

Online logit bias (OLB) evaluation technique that updates a per-token bias vector during sliding-window evaluation using the exact cross-entropy gradient
Int6 per-row quantized model with zstd compression
Sliding-window evaluation with stride 64
Custom 11-layer architecture with SmearGate, BigramHash, XSA, partial RoPE, and tied embeddings