Ass. Prof. Sapienza (Rome) | Author: Alice in a differentiable wonderland (https://www.sscardapane.it/alice-book/)
by @junhongshen1.bsky.social @lukezettlemoyer.bsky.social et al.
They use an LLM to predict a "complexity score" for each image token, which in turns decides the size of its VAE latent representation.
arxiv.org/abs/2501.03120
by @junhongshen1.bsky.social @lukezettlemoyer.bsky.social et al.
They use an LLM to predict a "complexity score" for each image token, which in turns decides the size of its VAE latent representation.
arxiv.org/abs/2501.03120
by Noah Hollmann et al.
A transformer for tabular data that takes an entire training set as input and provides predictions - trained on millions of synthetic datasets.
www.nature.com/articles/s41...
by Noah Hollmann et al.
A transformer for tabular data that takes an entire training set as input and provides predictions - trained on millions of synthetic datasets.
www.nature.com/articles/s41...
by @jwuphysics.bsky.social
Integrates a sparse dictionary step on the last layer of a CNN to obtain a set of interpretable features on multiple astronomical prediction tasks.
arxiv.org/abs/2501.00089
by @jwuphysics.bsky.social
Integrates a sparse dictionary step on the last layer of a CNN to obtain a set of interpretable features on multiple astronomical prediction tasks.
arxiv.org/abs/2501.00089
by @petar-v.bsky.social et al.
They show RoPE has distinct behavior for different rotation angles - high freq for position, low freq for semantics.
arxiv.org/abs/2410.06205
by @petar-v.bsky.social et al.
They show RoPE has distinct behavior for different rotation angles - high freq for position, low freq for semantics.
arxiv.org/abs/2410.06205
by Liang et al.
Adding a simple masking operation to momentum-based optimizers can significantly boost their speed.
arxiv.org/abs/2411.16085
by Liang et al.
Adding a simple masking operation to momentum-based optimizers can significantly boost their speed.
arxiv.org/abs/2411.16085
by @artidoro.bsky.social et al.
Trains a small encoder to dynamically aggregate bytes into tokens, which are input to a standard autoregressive model. Nice direction!
arxiv.org/abs/2412.09871
by @artidoro.bsky.social et al.
Trains a small encoder to dynamically aggregate bytes into tokens, which are input to a standard autoregressive model. Nice direction!
arxiv.org/abs/2412.09871
by @norabelrose.bsky.social @eleutherai.bsky.social
Analyzes training through the spectrum of the "training Jacobian" (∇ of trained weights wrt initial weights), identifying a large inactive subspace.
arxiv.org/abs/2412.07003
by @norabelrose.bsky.social @eleutherai.bsky.social
Analyzes training through the spectrum of the "training Jacobian" (∇ of trained weights wrt initial weights), identifying a large inactive subspace.
arxiv.org/abs/2412.07003
by Xu Owen He
Scales a MoE architecture up to millions of experts by implementing a fast retrieval method in the router, inspired by recent MoE scaling laws.
arxiv.org/abs/2407.04153
by Xu Owen He
Scales a MoE architecture up to millions of experts by implementing a fast retrieval method in the router, inspired by recent MoE scaling laws.
arxiv.org/abs/2407.04153
by Fifty et al.
Replaces the "closest codebook" operation in vector quantization with a rotation and rescaling operations to improve the back-propagation of gradients.
arxiv.org/abs/2410.06424
by Fifty et al.
Replaces the "closest codebook" operation in vector quantization with a rotation and rescaling operations to improve the back-propagation of gradients.
arxiv.org/abs/2410.06424
for Vision Transformers*
by Li et al.
Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning.
arxiv.org/abs/2411.09702
for Vision Transformers*
by Li et al.
Shows that distilling attention patterns in ViTs is competitive with standard fine-tuning.
arxiv.org/abs/2411.09702
by Yu et al.
Identifies single weights in LLMs that destroy inference when deactivated. Tracks their mechanisms through the LLM and proposes quantization-specific techniques.
arxiv.org/abs/2411.07191
by Yu et al.
Identifies single weights in LLMs that destroy inference when deactivated. Tracks their mechanisms through the LLM and proposes quantization-specific techniques.
arxiv.org/abs/2411.07191
by @ekinakyurek.bsky.social et al.
Shows that test-time training (fine-tuning at inference time) strongly improves performance on the ARC dataset.
arxiv.org/abs/2411.07279
by @ekinakyurek.bsky.social et al.
Shows that test-time training (fine-tuning at inference time) strongly improves performance on the ARC dataset.
arxiv.org/abs/2411.07279
by Barrault et al.
Builds an autoregressive model in a "concept" space by wrapping the LLM in a pre-trained sentence embedder (also works with diffusion models).
arxiv.org/abs/2412.08821
by Barrault et al.
Builds an autoregressive model in a "concept" space by wrapping the LLM in a pre-trained sentence embedder (also works with diffusion models).
arxiv.org/abs/2412.08821
by @phillipisola.bsky.social et al.
An encoder to compress an image into a sequence of 1D tokens whose length can dynamically vary depending on the specific image.
arxiv.org/abs/2411.02393
by @phillipisola.bsky.social et al.
An encoder to compress an image into a sequence of 1D tokens whose length can dynamically vary depending on the specific image.
arxiv.org/abs/2411.02393
by @alanjeffares.bsky.social @aliciacurth.bsky.social
Shows that tracking 1st-order approximations to the training dynamics provides insights into many phenomena (e.g., double descent, grokking).
arxiv.org/abs/2411.00247
by @alanjeffares.bsky.social @aliciacurth.bsky.social
Shows that tracking 1st-order approximations to the training dynamics provides insights into many phenomena (e.g., double descent, grokking).
arxiv.org/abs/2411.00247
by @alessiodevoto.bsky.social @sgiagu.bsky.social et al.
We propose a MoE graph transformer for particle collision analysis, with many nice interpretability insights (e.g., expert specialization).
arxiv.org/abs/2501.03432
by @alessiodevoto.bsky.social @sgiagu.bsky.social et al.
We propose a MoE graph transformer for particle collision analysis, with many nice interpretability insights (e.g., expert specialization).
arxiv.org/abs/2501.03432
Part deep learning history, part overview on the vast landscape of "efficiency" in DL (hardware, compilers, architecture, ...). Fantastic post!
alexzhang13.github.io/blog/2024/ef...
Part deep learning history, part overview on the vast landscape of "efficiency" in DL (hardware, compilers, architecture, ...). Fantastic post!
alexzhang13.github.io/blog/2024/ef...
Develops a theory of "modular duality" for designing principled optimizers that respect the "type semantics" of each layer.
arxiv.org/abs/2410.21265
Develops a theory of "modular duality" for designing principled optimizers that respect the "type semantics" of each layer.
arxiv.org/abs/2410.21265
Lens of Complexity*
by @thomasfel.bsky.social @louisbethune.bsky.social @lampinen.bsky.social
Wonderful work! They rank features' complexity with a variant of mutual information, before analyzing their dynamics.
arxiv.org/abs/2407.06076
Lens of Complexity*
by @thomasfel.bsky.social @louisbethune.bsky.social @lampinen.bsky.social
Wonderful work! They rank features' complexity with a variant of mutual information, before analyzing their dynamics.
arxiv.org/abs/2407.06076
My friend FM Bianchi made an awesome introduction to GNNs and pooling techniques over graphs, full of nice visuals and details! 🔥
gnn-pooling.notion.site/1-3-pooling-...
My friend FM Bianchi made an awesome introduction to GNNs and pooling techniques over graphs, full of nice visuals and details! 🔥
gnn-pooling.notion.site/1-3-pooling-...
by @talschuster.bsky.social et al.
Converts pre-trained transformers to a more efficient version by turning blocks of layers into a single layer which is iterated. Lots of interesting tricks!
arxiv.org/abs/2410.20672
by @talschuster.bsky.social et al.
Converts pre-trained transformers to a more efficient version by turning blocks of layers into a single layer which is iterated. Lots of interesting tricks!
arxiv.org/abs/2410.20672
by @tkrusch.bsky.social @mmbronstein.bsky.social
They propose a regularization approach for exploiting symmetries over data (penalizing variable predictions over augmented data).
arxiv.org/abs/2410.17878
by @tkrusch.bsky.social @mmbronstein.bsky.social
They propose a regularization approach for exploiting symmetries over data (penalizing variable predictions over augmented data).
arxiv.org/abs/2410.17878
They show that using softmax in the attention computation upper-bounds the Frobenius norm of the attention matrix, and similar results can be obtained with a polynomial normalization.
arxiv.org/abs/2410.18613
They show that using softmax in the attention computation upper-bounds the Frobenius norm of the attention matrix, and similar results can be obtained with a polynomial normalization.
arxiv.org/abs/2410.18613
with @alessiodevoto.bsky.social @neuralnoise.com
Happy to share our work on distilling efficient transformers with dynamic modules' activation was accepted at #AAAI2025. 🔥
arxiv.org/abs/2312.10193
with @alessiodevoto.bsky.social @neuralnoise.com
Happy to share our work on distilling efficient transformers with dynamic modules' activation was accepted at #AAAI2025. 🔥
arxiv.org/abs/2312.10193
by @colah.bsky.social @anthropic.com
Investigates stability & dynamics of "interpretable features" with cross-layers SAEs. Can also be used to investigate differences in fine-tuned models.
transformer-circuits.pub/2024/crossco...
by @colah.bsky.social @anthropic.com
Investigates stability & dynamics of "interpretable features" with cross-layers SAEs. Can also be used to investigate differences in fine-tuned models.
transformer-circuits.pub/2024/crossco...