PR #276

open

Non-record: local RTX 4070 shared-depth RMS interface v0

by riatzukizaView on GitHub
val_bpb
1.6577
Architecture
Transformer
Optimizer
Artifact Size
5912023 bytes

Training Techniques

Architecture
depth sharing / shared-depth
Uses 4 physical blocks to implement 8 logical layers, reducing parameter count while preserving multiple logical passes.
parameters: {"layers":8,"physical_layers":4}
weight tying
Ties input and output embeddings.
parameters: null
KV head count
Uses fewer key/value heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RMSNorm interface
Adds extra pre-projection RMSNorm in the shared-depth interface.
parameters: {"extra_proj_rmsnorm":1}
phase-conditioned scales
Adds tiny phase-conditioned scaling parameters to stabilize the shared-depth model.
parameters: {"phase_conditioned_scales":1,"phase_buckets":4}
Quantization
int8
bits: 8
scope: final serialized model
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmup
parameters: {"warmup_steps":4}
Other
other
Training was capped by a 900-second wallclock limit and stopped early at 471/500 steps.
parameters: {"max_wallclock_seconds":900,"stopped_step":471,"total_steps":500}

Novel Contributions

  • Non-record local consumer-GPU submission under the 16MB artifact cap
  • Shared-depth model with 8 logical layers implemented using 4 physical blocks
  • Extra pre-projection RMSNorm in the shared-depth interface
  • Tiny phase-conditioned scales with 4 phase buckets
  • Tied input/output embeddings with separate tied-embedding learning rate
  • Int8 plus zlib roundtrip artifact packaging