PR #276

open

Non-record: local RTX 4070 shared-depth RMS interface v0

val_bpb

1.6577

Architecture

Transformer

Optimizer

—

Artifact Size

5912023 bytes

Training Techniques

Architecture

depth sharing / shared-depth

Uses 4 physical blocks to implement 8 logical layers, reducing parameter count while preserving multiple logical passes.

parameters: {"layers":8,"physical_layers":4}

weight tying

Ties input and output embeddings.

parameters: null

KV head count

Uses fewer key/value heads than attention heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RMSNorm interface

Adds extra pre-projection RMSNorm in the shared-depth interface.

parameters: {"extra_proj_rmsnorm":1}

phase-conditioned scales

Adds tiny phase-conditioned scaling parameters to stabilize the shared-depth model.

parameters: {"phase_conditioned_scales":1,"phase_buckets":4}

Quantization

int8

bits: 8

scope: final serialized model

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

LR Schedule

warmup

parameters: {"warmup_steps":4}

Other

other

Training was capped by a 900-second wallclock limit and stopped early at 471/500 steps.

parameters: {"max_wallclock_seconds":900,"stopped_step":471,"total_steps":500}