val_bpb
1.3288
Architecture
Transformer
Optimizer
—
Artifact Size
11,673,884 bytes
Training Techniques
Architecture
GQA
Uses grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input embeddings and output embeddings.
parameters: null
attention modifications
Applies a Lorentz-style hyperbolic transform only to attention q and k projections using trainable hyperbolic_qk_mix and hyperbolic_radius.
parameters: {"hyperbolic_qk_mix":0.02,"hyperbolic_radius_init":0.1}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: model weights
Novel Contributions
- Lightweight hyperbolic attention modification applied only to q/k projections
- Trainable hyperbolic_qk_mix and hyperbolic_radius parameters
- End-to-end runnable non-record research package with smoke and ablation logs
- Demonstrates improved validation bpb over early smoke runs with a compact 1xH100 setup