val_bpb
1.1448
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.6 MB
Training Techniques
Architecture
TrigramHash
Hash-based trigram embedding that XOR-hashes 3 consecutive token IDs into 2048 buckets and projects to model dimension.
parameters: {"vocab_size":2048,"trigram_dim":48,"project_dim":512}
LeakyReLU
Uses LeakyReLU(0.5)^2 in the MLP to preserve negative gradient flow.
parameters: {"negative_slope":0.5}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA + Tight SWA
parameters: {"decay":0.997,"swa_interval_steps":50}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500}
Regularization
LN scale
parameters: {"schedule":"1/sqrt(layer+1)"}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Other
other
Uses gradient accumulation scaled by world size to keep effective batch size constant across 1-GPU and 8-GPU runs.
parameters: {"grad_accum_formula":"8 // world_size"}
Novel Contributions
- TrigramHashEmbedding extending BigramHash to 3-token context
- XOR prime hashing of trigrams into 2048 buckets
- LeakyReLU(0.5)^2 MLP activation
- Proportional wallclock validation on 1×H100 to match 8×H100 training trajectory
- EMA + Tight SWA with GPTQ-lite int6 and LZMA compression