PR #435

open

Radial bitnet submission

val_bpb
1.6130
Architecture
Compressed dual-branch Transformer
Optimizer
FROStable + AdamW
Artifact Size
15,943,179 bytes

Training Techniques

Architecture
BigramHash
Adds a 1024-bucket hashed bigram embedding branch for short-horizon lexical context.
parameters: {"buckets":1024}
Radial Token Branch
Adds a token-ID-derived radial geometric feature branch projected into the fusion space.
parameters: null
BitNet-style ternary projections
Uses ternary-weight forward behavior in major internal projections to reduce storage pressure.
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: null
Weight Averaging
EMA
parameters: null
Quantization
mixed int8/int6
bits: null
scope: selected weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
cosine decay
parameters: {"warmup":true}
Regularization
weight decay
parameters: null
Other
other
FRO (Fractal Resonant Optimization) used as the main optimizer on the compressed transformer core.
parameters: null
other
Light export-time pruning of values below 0.0025 before final artifact serialization.
parameters: {"threshold":0.0025}

Novel Contributions

  • FRO (Fractal Resonant Optimization) as the main optimizer
  • Radial Token Branch for token-level geometric features
  • 1024-bucket bigram hash branch for short-horizon lexical context
  • BitNet-style ternary-weight behavior in major internal projections
  • Mixed post-training export with int8/int6 serialization
  • Light export-time pruning