PR #2042

open

Non-record 10min/16MB: BitNet-1.58 Ternary HydraMLP (1xH100, val_bpb 1.36407)

by FF-GardenFnView on GitHub

val_bpb

1.3641

Architecture

Hybrid

Optimizer

Muon

Artifact Size

12,886,324 bytes

Training Techniques

Quantization

STE QAT

bits: 1

scope: HydraMLP gate_up and down weights

Architecture

weight tying

Tied embeddings are used in the bifurcated local-global recurrent LM.

parameters: null

GQA

Global branch uses grouped query attention.

parameters: {"global_heads":8,"global_kv_heads":4}

BigramHash

Bigram additive logit priors are loaded as part of the model's prior biasing.

parameters: null

TrigramHash

Trigram CP priors are loaded as additive logit biases.

parameters: null

depth recurrence

Bifurcated local-global recurrent language model with recurrent structure.

parameters: null

MLP3x

HydraMLP uses an expanded MLP multiplier in the local branch.

parameters: {"mlp_mult":3.25}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"newton_schulz_steps":5,"safety_factor":1.05,"min_params":32768,"same_shape_batch":true}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

BitNet-1.58 ternary fake-quantization applied to HydraMLP gate_up and down projections
Buffered absmean ternary scaling persisted through export and restore for exact roundtrip behavior
Artifact size reduction from an int4 baseline to fit under the 16 MB cap
Ternary export format packing each group as int8 signs plus fp16 per-group scale
Bifurcated local-global recurrent LM with additive bigram and trigram priors