PR #183

open

Non-record: Cache LM + LoRA TTT (negative result on cache, positive on TTT)

val_bpb

1.2529

Architecture

Transformer

Optimizer

Adam

Artifact Size

15.14 MB

Training Techniques

Quantization

int8

bits: 8

scope: model artifact

Architecture

tied embeddings

Input and output embeddings are tied.

parameters: null

Optimizer

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.01,"betas":[0.9,0.95]}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":128}

Test-Time Training

LoRA TTT

parameters: {"rank":8,"learning_rate":0.01,"chunk_size":256,"batch_size":64}

Other

other

Unigram cache language model interpolation during evaluation, using decayed per-document token frequency counts.

parameters: {"lambda":0.02,"decay":0.98}

Sequence Length

sequence_length

train_length: null

eval_length: 1024