PR #1983

open

Add submission: Int5/Int6 + BigramHash + SmearGate + SWA + LLMAdvisor…

by harborglowvintage-ossView on GitHub

val_bpb

1.1586

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.72 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP weights and attention weights

Architecture

BigramHash

Bigram-hash embeddings used in place of standard embeddings.

parameters: null

SmearGate

Gate mechanism added to the model.

parameters: null

weight tying

Input and output embeddings are tied.

parameters: null

Weight Averaging

SWA

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adamw_used_for":"scalars/embeddings"}

Compression

zstd

level: 22

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Sequence Length

sequence_length

train_length: 2048

eval_length: null