PR #2042
openNon-record 10min/16MB: BitNet-1.58 Ternary HydraMLP (1xH100, val_bpb 1.36407)
by FF-GardenFnView on GitHub
val_bpb
1.3641
Architecture
Hybrid
Optimizer
Muon
Artifact Size
12,886,324 bytes
Training Techniques
Quantization
STE QAT
bits: 1
scope: HydraMLP gate_up and down weights
Architecture
weight tying
Tied embeddings are used in the bifurcated local-global recurrent LM.
parameters: null
GQA
Global branch uses grouped query attention.
parameters: {"global_heads":8,"global_kv_heads":4}
BigramHash
Bigram additive logit priors are loaded as part of the model's prior biasing.
parameters: null
TrigramHash
Trigram CP priors are loaded as additive logit biases.
parameters: null
depth recurrence
Bifurcated local-global recurrent language model with recurrent structure.
parameters: null
MLP3x
HydraMLP uses an expanded MLP multiplier in the local branch.
parameters: {"mlp_mult":3.25}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_steps":5,"safety_factor":1.05,"min_params":32768,"same_shape_batch":true}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- BitNet-1.58 ternary fake-quantization applied to HydraMLP gate_up and down projections
- Buffered absmean ternary scaling persisted through export and restore for exact roundtrip behavior
- Artifact size reduction from an int4 baseline to fit under the 16 MB cap
- Ternary export format packing each group as int8 signs plus fp16 per-group scale
- Bifurcated local-global recurrent LM with additive bigram and trigram priors