Public
Edited
May 24, 2023
Insert cell
Insert cell
Insert cell
Insert cell
text = `%%%%%%%% ICML 2022 EXAMPLE LATEX SUBMISSION FILE %%%%%%%%%%%%%%%%%
\\pdfoutput=1

\\documentclass{article}

% Recommended, but optional, packages for figures and better typesetting:
\\usepackage{microtype}
\\usepackage{graphicx}
\\usepackage{subfigure}
\\usepackage{booktabs} % for professional tables
\\usepackage{xspace}
\\usepackage{makecell}
\\usepackage{diagbox}
\\usepackage{nicefrac}
\\usepackage{amsmath}
\\usepackage{float}
% \\usepackage{minted}
\\newcommand{\\pushright}[1]{\\ifmeasuring@#1\\else\\omit\\hfill$\\displaystyle#1$\\fi\\ignorespaces}

% \\setminted{fontsize=\\small}
\\usepackage{listings}
\\lstset{
language=Python,
basicstyle=\\ttfamily\\small,
keywordstyle=\\color{blue},
commentstyle=\\color{green},
stringstyle=\\color{red},
backgroundcolor=\\color{gray!15},
morekeywords={GlobalLocalTransformerDecoder, forward, prepare_input, __init__},
showstringspaces=false,
breaklines=true,
}
\\newcommand{\\fastslow}[0]{\\textsc{MegaByte}\\xspace}
\\newcommand{\\megabyte}[0]{\\fastslow}
\\newcommand{\\bpb}[0]{bpb\\xspace}
\\newcommand{\\h}[1]{h^\\text{#1}}%_{#2}}
\\newcommand{\\inR}[1]{\\in\\mathbb{R}^{#1}}
\\newcommand{\\transformer}[2]{\\text{transformer}^{\\text{#1}}(#2)}
\\newcommand{\\Lglobal}[0]{L_{\\text{global}}}
\\newcommand{\\Llocal}[0]{L_{\\text{local}}}

\\newcommand{\\todo}[1]{\\textcolor{red}{[TODO: #1]}}
\\newcommand{\\wip}[1]{\\textcolor{purple}{#1}}
\\newcommand{\\mike}[1]{\\textcolor{blue}{[Mike: #1]}}
\\newcommand{\\lly}[1]{\\textcolor{magenta}{[LL: #1]}}



% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \\usepackage{icml2022} with \\usepackage[nohyperref]{icml2022} above.
\\usepackage{hyperref}


% Attempt to make hyperref and algorithmic work together better:
\\newcommand{\\theHalgorithm}{\\arabic{algorithm}}

% Use the following line for the initial blind version submitted for review:
\\usepackage[accepted]{icml2022}

% If accepted, instead use the following line for the camera-ready submission:
% \\usepackage[accepted]{icml2022}

% For theorems and such
\\usepackage{amsmath}
\\usepackage{amssymb}
\\usepackage{mathtools}
\\usepackage{amsthm}

% if you use cleveref..
\\usepackage[capitalize,noabbrev]{cleveref}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\\theoremstyle{plain}
\\newtheorem{theorem}{Theorem}[section]
\\newtheorem{proposition}[theorem]{Proposition}
\\newtheorem{lemma}[theorem]{Lemma}
\\newtheorem{corollary}[theorem]{Corollary}
\\theoremstyle{definition}
\\newtheorem{definition}[theorem]{Definition}
\\newtheorem{assumption}[theorem]{Assumption}
\\theoremstyle{remark}
\\newtheorem{remark}[theorem]{Remark}

% Todonotes is useful during development; simply uncomment the next line
% and comment out the line below the next line to turn off comments
%\\usepackage[disable,textsize=tiny]{todonotes}
% \\usepackage[textsize=tiny]{todonotes}


% The \\icmltitle you define below is probably too long as a header.
% Therefore, a short form for the running title is supplied here:
%\\icmltitlerunning{Bytes In, Bytes Out: Efficient Byte-level Modeling with Multiscale Transformers}

\\begin{document}

\\twocolumn[
\\icmltitle{\\fastslow: Predicting Million-byte Sequences with Multiscale Transformers}

% It is OKAY to include author information, even for blind
% submissions: the style file will automatically remove it for you
% unless you've provided the [accepted] option to the icml2022
% package.

% List of affiliations: The first argument should be a (short)
% identifier you will use later to specify author affiliations
% Academic affiliations should list Department, University, City, Region, Country
% Industry affiliations should list Company, City, Region, Country

% You can specify symbols, otherwise they are numbered in order.
% Ideally, you should not use this facility. Affiliations will be numbered
% in order of appearance and this is the preferred way.
\\icmlsetsymbol{equal}{*}

\\begin{icmlauthorlist}
\\icmlauthor{Lili Yu}{equal,meta}
\\icmlauthor{Dániel Simig}{equal,meta}
\\icmlauthor{Colin Flaherty}{equal,augment}
\\icmlauthor{Armen Aghajanyan}{meta}
\\icmlauthor{Luke Zettlemoyer}{meta}
\\icmlauthor{Mike Lewis}{meta}

%\\icmlauthor{Firstname3 Lastname3}{comp}
%\\icmlauthor{Firstname4 Lastname4}{sch}
%\\icmlauthor{Firstname5 Lastname5}{yyy}
%\\icmlauthor{Firstname6 Lastname6}{sch,yyy,comp}
%\\icmlauthor{Firstname7 Lastname7}{comp}
%\\icmlauthor{}{sch}
%\\icmlauthor{Firstname8 Lastname8}{sch}
%\\icmlauthor{Firstname8 Lastname8}{yyy,comp}
%\\icmlauthor{}{sch}
%\\icmlauthor{}{sch}
\\end{icmlauthorlist}

\\icmlaffiliation{augment}{Augment Computing. Work performed while at Meta AI.}
\\icmlaffiliation{meta}{Meta AI.}
%\\icmlaffiliation{yyy}{Department of XXX, University of YYY, Location, Country}
%\\icmlaffiliation{comp}{Company Name, Location, Country}
%\\icmlaffiliation{sch}{School of ZZZ, Institute of WWW, Location, Country}

\\icmlcorrespondingauthor{Lili Yu}{liliyu@meta.com}
% \\icmlcorrespondingauthor{Daniel Simig}{simigd@gmail.com}
% \\icmlcorrespondingauthor{Colin Flaherty}{colin@augmentcode.com}
\\icmlcorrespondingauthor{Mike Lewis}{mikelewis@meta.com}


% You may provide any keywords that you
% find helpful for describing your paper; these are used to populate
% the "keywords" metadata in the PDF but will not be shown in the document
\\icmlkeywords{Machine Learning, ICML}

\\vskip 0.3in
]

% this must go after the closing bracket ] following \\twocolumn[ ...

% This command actually creates the footnote in the first column
% listing the affiliations and the copyright notice.
% The command takes one argument, which is text to display at the start of the footnote.
% The \\icmlEqualContribution command is standard text for equal contribution.
% Remove it (just {}) if you do not need this facility.

%\\printAffiliationsAndNotice{} % leave blank if no need to mention equal contribution
\\printAffiliationsAndNotice{\\icmlEqualContribution} % otherwise use the standard text.

\\begin{abstract}
Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We propose \\fastslow, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
\\fastslow segments sequences into patches and uses a \\emph{local} submodel within patches and a \\emph{global} model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding---unlocking better performance at reduced cost for both training and generation.
Extensive experiments show that \\fastslow allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.
\\end{abstract}

\\section{Introduction}
\\label{sec:intro}
Sequences of millions of bytes are ubiquitous; for example, music, image, or video files typically consist of multiple megabytes.
However, large transformer decoders (LLMs) typically only use several thousand tokens of context \\citep{gpt3,Zhang2022OPT:Models}---both because of the quadratic cost of self-attention but also, more importantly, the cost of large feedforward networks per-position.
This severely limits the set of tasks where LLMs can be applied.

\\begin{figure}
\\centering
\\includegraphics[width=\\linewidth]{figs/fastslow3.pdf}
% \\caption{Illustration of \\megabyte model architecture and the modeling of "megabyte transformer" byte sequence. In this figure, padding is represented by underscores ("_"), spaces are represented by quotes ("' '"), and the addition operator is represented by a plus sign.}
\\vspace*{-5mm}
\\caption{Overview of \\megabyte with patch size $P=4$.
A small \\emph{local} model autoregressively predicts each patch byte-by-byte, using the output of a larger \\emph{global} model to condition on previous patches.
%A large \\emph{global} model performs self-attention over previous patches, and a smaller \\emph{local} model autoregressively predicts each patch byte-by-byte.
Global and Local inputs are padded by $P$ and $1$ token respectively to avoid leaking information about future tokens.
%Initial patch embeddings are made by concatenating byte embeddings.
%The \\emph{global} model enriches patch embeddings by performing self-attention over previous patches.
%We then add byte embeddings (padded by 1), and an instance of a smaller \\emph{local} model autoregressively
%The sequences is padded by P tokens and each byte is embedded ($h^{embed}$}. Byte embeddings are
\\label{fig:framework}
}
\\end{figure}
We introduce \\fastslow, a new approach to modeling long byte sequences.
First, byte sequences are segmented into fixed-sized patches, loosely analogous to tokens.
Our model then consists of three parts: (1) a \\emph{patch embedder}, which simply encodes a patch by losslessly concatenating embeddings of each byte, (2) a \\emph{global} module, a large autoregressive transformer that inputs and outputs patch representations and (3) a \\emph{local} module, a small autoregressive model that predicts bytes within a patch.
Crucially, we observe that for many tasks, most byte predictions are relatively easy (for example, completing a word given the first few characters), meaning that large networks per-byte are unnecessary, and a much smaller model can be used for intra-patch modelling.

%While many approaches to modelling long sequences have focused on optimizing attention layers \\citep{}, the large majority of compute in large language models is used by position-wise feedforward layers \\citep{}. To reduce this cost, we must either use smaller feedforward layers (reducing performance) or compressing the sequence length (which is challenging with decoder-only models, where an output is needed at each timestep). These constraints lead to our multi-scale design, in which long range and short range decisions are handled by separate modules.

The \\megabyte architecture gives three major improvements over Transformers for long sequence modelling:
\\begin{enumerate}
\\item \\textbf{Sub-quadratic self-attention} Most work on long sequence models has focused on mitigating the quadratic cost of self-attention. \\megabyte decomposes long sequences into two shorter sequences, and optimal patch sizes reduces the self-attention cost to $O(N^{\\frac{4}{3}})$, which remains tractable for even long sequences.
\\item \\textbf{Per-patch feedforward layers} In GPT3-size models, more than 98\\% of FLOPS are used in computing position-wise feedforward layers. \\megabyte uses large feedforward layers per-patch rather than per-position, enabling much larger and more expressive models for the same cost. With patch size $P$, where a baseline transformer would use the same feedforward layer with $m$ parameters $P$ times, \\megabyte can use a layer with $mP$ parameters once for the same cost.
\\item \\textbf{Parallelism in Decoding} Transformers must perform all computations serially during generation because the input to each timestep is the output from the previous timestep. By generating representations for patches in parallel, \\megabyte allows greater parallelism during generation. For example, a \\megabyte model with 1.5B parameters can generate sequences 40\\% \\emph{faster} than a standard 350M Transformer, whilst also improving perplexity when trained with the same compute. %, increasing generation speed by up to \\mike{XX}x.
\\end{enumerate}

Together, these improvements allow us to train much larger and better-performing models for the same compute budget, scale to very long sequences, and improve generation speed during deployment.



\\fastslow also provides a strong contrast to existing autoregressive models that typically use some form of tokenization, where sequences of bytes are mapped to larger discrete tokens \\citep{bpe,dalle,hubert}.
%, remains the most common approach to scaling LLMs to longer sequences.
Tokenization complicates pre-processing, multi-modal modelling, and transfer to new domains, while hiding useful structure from the model. It also means that most state-of-the-art models are not truly end to end.
The most widely used approaches to tokenization require language-specific heuristics \\citep{gpt2} or lose information \\citep{dalle}.
Replacing tokenization with efficient and performant byte models would therefore have many advantages.

We conduct extensive experiments for both \\fastslow and strong baselines. We use a fixed compute and data budget across all models
to focus our comparisons solely on the model architecture rather than training resources, which are known to benefit all models. We find that \\fastslow allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art perplexities for density estimation on ImageNet, and allow audio modelling from raw audio files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

\\section{\\megabyte Transformer}
\\subsection{Overview}
\\label{sec:fastslow}





%\\subsection{Patch Encoder}
%\\subsection{Slow Model}
%\\subsection{Fast Model}
%\\paragraph{Cross-Patch Self Attention}
%\\subsection{Interface Between Slow and Fast Models}

% At a high-level,






\\begin{figure*}[t]
\\begin{minipage}{0.7\\linewidth}
\\begin{align*}
%x^{\\text{padded}} &= ([\\text{PAD}]\\times P) + x_{0:(T-P)}\\\\
\\h{embed}_{t}\\,\\,\\,\\,\\,\\,\\,\\,&= E^\\text{global-embed}_{x_t}+E^\\text{pos}_t & t\\in [0..T), E^{\\text{global-embed}}\\inR{V\\times D_G},\\\\& &E^{\\text{pos}}\\inR{T\\times D_G}, \\h{embed}\\inR{T\\times D_G} \\\\
% \\h{global-in}_{0} &= [E^{\\text{pad}}_0,..,E^{\\text{pad}}_{P-1}] & E^{\\text{pad}}\\inR{D_G}, K=\\frac{T}{P}\\\\
% \\h{global-in}_{k} &= \\h{embed}_{((k-1)\\cdot P):(k\\cdot P)} & k\\in [1,..,K) \\\\
\\h{global-in}_{k} \\,\\,\\,\\,\\,&=
\\begin{cases}
E^{\\text{global-pad}}, & \\text{if } k = 0, \\\\
\\h{embed}_{((k-1)\\cdot P):(k \\cdot P)}, & k\\in [1,..,K), \\\\
\\end{cases} & E^{\\text{global-pad}}\\inR{P\\times D_G}, K=\\frac{T}{P}
\\\\
\\h{global-out}_{0:K} \\,\\,\\,&= \\transformer{global}{\\h{global-in}_{0:K}} & \\h{global-out}\\inR{K\\times P\\cdot D_G},\\h{global-in}\\inR{K\\times P\\cdot D_G}\\\\
%\\h{local-in}_{k,0} &= \\h{global-out}_{k,0:D_G} & \\h{local-in}\\inR{K\\times P \\times D_G} \\\\
%\\h{local-in}_{k,p} &= w^{\\text{GL}}\\h{global-out}_{k,(p\\cdot D_G):((p+1)\\cdot D_G)} + E^\\text{local-embed}_{x_{(k\\cdot P + p - 1)}} & w^{\\text{GL}}\\inR{D_G\\times D_L}, E^{\\text{local-embed}}\\inR{V\\times D_L}, p\\in [0,..,P), \\\\
\\h{local-in}_{k,p} \\,\\,\\,\\,\\,\\,\\,&= w^{\\text{GL}}\\h{global-out}_{k,(p\\cdot D_G):((p+1)\\cdot D_G)} +
\\begin{cases}
E^\\text{local-pad}, & \\text{if } p=0 \\\\
E^\\text{local-embed}_{x_{(k\\cdot P + p - 1)}}, & p\\in [1,..,P) \\\\
\\end{cases} &
\\begin{array}{r}
E^{\\text{local-pad}}\\inR{D_L} , w^{\\text{GL}}\\inR{D_G\\times D_L}\\\\ E^{\\text{local-embed}}\\inR{V\\times D_L}
\\end{array}\\hspace{-2mm}
\\\\
\\h{local-out}_{k,0:P} \\,\\,\\,\\,\\,&= \\transformer{local}{\\h{local-in}_{k,0:P}} & \\h{local-in}_{k,p}\\inR{D_L}, \\h{local-out}\\inR{K\\times P\\cdot D_L} \\\\
p(x_{t}|x_{0:t}) &= \\text{softmax}{(E^\\text{local-embed}\\h{local-out}_{k,p})}_{x_t} & t=k\\cdot P+p\\,
\\end{align*}
\\end{minipage}
\\caption{Summary of \\megabyte with vocabulary $V$, sequence length $T$, global and local dimensions $D_G$ and $D_L$, and $K$ patches of size $P$. Transformer layers use masked self attention to not observe information from future timesteps.
}
\\end{figure*}
% K\\times P\\cdot D_G
% \\h{local-in}\\inR{K\\times P \\times D_L}



% , while possessing many of the desirable properties of the Transformer architecture; including support for parallelism techniques enabling efficient scaling, a lack of recurrence, permutation invariance across the entire context, a single cross-entropy objective enabling end-to-end differentiation, and general architectural simplicity.
\\megabyte is an autoregressive model for efficiently modeling long input sequences. \\megabyte is comprised of 3 components: (1) a \\emph{patch embedder} that inputs a discrete sequence, embeds each element, and chunks it into patches of length $P$ (2) a large \\emph{global} Transformer that contextualizes patch representations by performing self-attention over previous patches, and (3) a smaller \\emph{local} Transformer that inputs a contextualized patch representation from the global model, and autoregressively predict the \\emph{next} patch.


% Specifically, consider that any document $X$ (text, image or speech) can be serialized as a sequence of $t$ bytes $X\\in \\mathbb{R}^t$. These bytes can then be partitioned into small “patches” of $p$ contiguous bytes each, resulting in a compressed document encoding $\\tilde{X}\\in \\mathbb{R}^{\\nicefrac{t}{p}}$. Increasing $p$ allows us to dramatically compress the sequence length of $\\tilde{X}$, which we can then efficiently model with a large language model.

%In general, any input of any modality can be represented as a sequence of $t$ bytes denoted by $X = \\{ \\mathbf{x}_i \\}_{i=1}^t$, where each $\\mathbf{x}_i$ belongs to the set $\\{ 0, 1, \\dots, 255\\}$. To compress $X$ for more efficient modeling, we partition the byte sequence into small, contiguous patches of length $p$, yielding a compressed representation denoted by $\\tilde{X} = \\{ \\mathbf{P}_i \\}_{i=1}^{\\nicefrac{t}{p}}$, where where $ \\mathbf{P}_i = \\{ \\mathbf{y}_{i,j} \\}_{j=1}^p$, and each $\\mathbf{y}_{i,j} \\in \\{ 0, 1, \\dots, 255\\} $. By increasing the patch length $p$, we can drastically reduce the sequence length of $\\tilde{X}$. This compressed representation can be efficiently modeled using a large language model.

% $\\textbf{Patch Encoder.}$ Consider endowing each byte with a small embedding of dimension $e$ (i.e. from a dictionary of 256 learnable embeddings), resulting in a sequence of byte embeddings $E\\in \\mathbb{R}^{t\\times e}$, as in the traditional byte-level modeling problem setup. In order to construct a sequence of larger patch embeddings, we simply apply a concatenation operation to each patch of byte embeddings:

\\subsection{Components}
$\\textbf{Patch Embedder}$ with patch size of $P$ maps a byte sequence $x_{0..T}$ to a sequence of patch embeddings %$h\\in \\mathbb{R}^{\\frac{t}{p}\\times d_{\\text{global}}}.%$
of length $K=\\frac{T}{P}$ and dimension $P\\cdot D_G$.

First, each byte is embedded with a lookup table $E^{\\text{global-embed}}\\inR{V \\times D_G}$ to an embedding of size $D_G$ and positional embeddings are added.

\\begin{align}
\\h{embed}_{t}&= E^\\text{global-embed}_{x_t}+E^\\text{pos}_t & t\\in [0..T]
\\end{align}

`
Insert cell
text.replace(/(\n?%+.+)/g, "")
Insert cell

Purpose-built for displays of data

Observable is your go-to platform for exploring data and creating expressive data visualizations. Use reactive JavaScript notebooks for prototyping and a collaborative canvas for visual data exploration and dashboard creation.
Learn more