val_bpb
1.2286
Architecture
Transformer
Optimizer
—
Artifact Size
15,850,915 bytes
Training Techniques
Architecture
GQA
Grouped query attention used in the 9x512 baseline.
parameters: {"layers":9,"width":512,"heads":8,"kv_heads":4}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Other
other
Saliency-guided training recipe with token prior, dynamic correction, phrase term, and attention bias enabled; bigram saliency disabled.
parameters: {"saliency_token_prior":true,"saliency_dynamic_correction":true,"saliency_phrase_term":true,"saliency_attention_bias":true,"saliency_bigram":false}
LR Schedule
warmdown
parameters: {"legacy_flat_plus_tail":true}
cosine decay
parameters: {"warmup_steps":64,"decay_start_frac":0.65,"min_scale":0.15}
Novel Contributions
- Saliency-guided sweep around a 9x512 GQA baseline
- Comparison of a fixed-token 5B run, a 24-hour continuation, and a 1-hour cosine-schedule proxy run
- Use of saliency token prior, dynamic correction, phrase term, and attention bias in the training recipe
- Local RTX 5090 continuation achieving the best full-eval post-quant val_bpb of 1.22864374