PR #1074

open

Non-record: Hyperbolic Q/K Lite 1xH100 exploration package

val_bpb
1.3288
Architecture
Transformer
Optimizer
Artifact Size
11,673,884 bytes

Training Techniques

Architecture
GQA
Uses grouped query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
weight tying
Tied input embeddings and output embeddings.
parameters: null
attention modifications
Applies a Lorentz-style hyperbolic transform only to attention q and k projections using trainable hyperbolic_qk_mix and hyperbolic_radius.
parameters: {"hyperbolic_qk_mix":0.02,"hyperbolic_radius_init":0.1}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: model weights

Novel Contributions

  • Lightweight hyperbolic attention modification applied only to q/k projections
  • Trainable hyperbolic_qk_mix and hyperbolic_radius parameters
  • End-to-end runnable non-record research package with smoke and ablation logs
  • Demonstrates improved validation bpb over early smoke runs with a compact 1xH100 setup