PR #1748

open

basic submission improving baseline

by elad-simbalistaView on GitHub

val_bpb

1.2098

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,872,012 bytes

Training Techniques

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Optimizer

Muon

weight_decay: null

momentum: 0.985

other_params: {"warmup_from":0.9,"warmup_steps":500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

GPTQ-lite

bits: 8

scope: per-row

Architecture

weight tying

Tied input and output embeddings.

parameters: null

KV head count

Used grouped-query style attention with fewer KV heads than query heads.

parameters: {"num_heads":8,"num_kv_heads":4}