PR #1995

open

Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878)

by User123331View on GitHub
val_bpb
1.0878
Architecture
Transformer
Optimizer
Artifact Size
15,631,730 bytes

Training Techniques

Architecture
depth recurrence
Recurrent-depth transformer loop structure over the middle blocks with Parcae-style loop boundary injection.
parameters: null
weight tying
Tied embedding and output head path.
parameters: null
Gated Attention
QK-gain attention initialization and attention modifications mentioned in the submission.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: weights
mixed int6/int7
bits: null
scope: matrix weights + embeddings
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: 1300
Initialization
QK-gain
Attention initialization using QK-gain.

Novel Contributions

  • Parcae-style loop boundary injection for recurrent-depth transformer blocks
  • Testing the Parcae loop-injection direction under the 8xH100 / 16MB setting
  • px43/embed7/clip1300 compression setup
  • Legal sliding-window evaluation path
  • EMA post-training weights with GPTQ-based compression