PR #1074

open

Non-record: Hyperbolic Q/K Lite 1xH100 exploration package

val_bpb

1.3288

Architecture

Transformer

Optimizer

—

Artifact Size

11,673,884 bytes

Training Techniques

Architecture

GQA

Uses grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

weight tying

Tied input embeddings and output embeddings.

parameters: null

attention modifications

Applies a Lorentz-style hyperbolic transform only to attention q and k projections using trainable hyperbolic_qk_mix and hyperbolic_radius.

parameters: {"hyperbolic_qk_mix":0.02,"hyperbolic_radius_init":0.1}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Compression

zlib

level: null

Quantization

int8

bits: 8

scope: model weights

Lightweight hyperbolic attention modification applied only to q/k projections
Trainable hyperbolic_qk_mix and hyperbolic_radius parameters
End-to-end runnable non-record research package with smoke and ablation logs
Demonstrates improved validation bpb over early smoke runs with a compact 1xH100 setup