PR #976

open

Add 1.20 BPB submission with Legal TTT and Calibration (9L/448D)

val_bpb

1.2058

Architecture

Transformer

Optimizer

—

Artifact Size

15.87MB

Training Techniques

Architecture

weight tying

Input and output embeddings are tied.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Uses a 9-layer, 512-dimension Transformer with 8 attention heads and 4 KV heads, MLP multiplier 2.

parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}

Quantization

int8

bits: 8

scope: model weights

parameters: null

Compression

zlib

level: null