PR #767
openNon-record 1xH100 backoff7gram zlib-fallback sign-of-life (val_bpb 0.9209)
by RichiiiTVView on GitHub
val_bpb
0.9209
Architecture
Transformer
Optimizer
Muon
Artifact Size
7,772,644 bytes
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
weight tying
Uses tied embeddings / tied weights in the compact #753-style root lane.
parameters: null
RoPE
Uses RoPE dimensions as part of the model configuration.
parameters: {"dimensions":24}
XSA
Includes XSA-related configuration in the model setup.
parameters: {"last_n":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
adaptive n-gram backoff eval
parameters: {"min_order":2,"max_order":7,"adaptive":1,"alpha":0.3,"alpha_min":0.05,"alpha_max":0.6,"buckets":4194304,"entropy_center":4,"entropy_scale":2,"min_count":2}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"warmup_steps":20}
Regularization
weight decay
parameters: {"adam_wd":0.04,"muon_wd":0.04}
Other
other
Used flash-attn when available, but fell back to zlib export because zstandard was missing on the pod.
parameters: {"flash_attn":true,"zstandard_missing":true}
Novel Contributions
- Non-record 1xH100 sign-of-life run of the compact #753-style root lane
- Demonstrates strong legal score-first adaptive 2..7-gram backoff evaluation even with an undertrained dense base
- Uses int6 plus zlib fallback export when zstandard is unavailable
- Shows a compact submission artifact size of 7,772,644 bytes