PR #75

open

Add seq4096 sliding-window fp16 tok coarsen record

val_bpb

1.1768

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,943,260 bytes

Training Techniques

Quantization

int8

bits: 8

scope: all weights except tok_emb.weight; selective coarsening on blocks.5.

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: {"tie_embeddings":1}

KV head count

Uses grouped-query style attention with fewer KV heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"warmup_start":0.92,"warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Evaluation

sliding window eval

parameters: {"stride":64,"window_length":4096,"batch_size":32}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Compression

zlib

level: null

Adds a new 10-minute 8xH100 sliding-window record.
Uses stride-64 sliding-window evaluation after standard exact roundtrip checking.
Keeps tok_emb.weight in fp16 while coarsening only blocks.5. to fit the artifact budget.
Trains at sequence length 4096 with a tuned Muon schedule.