PR #228

open

Record: 10-Layer 4xMLP (val_bpb: 1.4444)

val_bpb

1.4444

Architecture

Transformer

Optimizer

—

Artifact Size

14.68 MB

Training Techniques

Architecture

10-layer 4xMLP

Expanded the standard 9-layer architecture to 10 layers and increased the MLP multiplier from 2x to 4x.

parameters: {"layers":10,"mlp_multiplier":4}

Quantization

int8

bits: 8

scope: all weights

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"overlapping":true}

Test-Time Training

LoRA TTT

parameters: {"batched":true}