Yuki Asano
yukimasano.bsky.social
Yuki Asano
@yukimasano.bsky.social
Professor at University of Technology Nuremberg
Head of Fundamental AI Lab
Thanks for tagging. In addition have a look at NV-Embed paper: arxiv.org/abs/2405.17428 they do contrastive finetuning after turning on the bidirectional attention mask
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retri...
arxiv.org
November 28, 2024 at 5:53 PM
and also perhaps interesting for you: probing text-representations of LLMs for CLIP-like zero-shot classification: arxiv.org/abs/2410.07173
Do better language models have crisper vision?
How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, e...
arxiv.org
November 26, 2024 at 1:21 PM
Reposted by Yuki Asano
Sam next to his poster; I'm still very impressed he did all this for his MSc thesis! #BMVC2024
November 26, 2024 at 10:25 AM
exactly. hence the new post-(pre)training term perhaps? post-training seems to be a good generic term for the RLHF/preference tuning etc in NLP allenai.org/papers/tulu-.... so by saying post-pretraining, we could emphasize the fact it's unsupervised
allenai.org
November 26, 2024 at 8:30 AM
"Post-pretraining", "unsupervised domain adaptation" fits, but I think is used for different tasks
November 26, 2024 at 8:01 AM
This work was led by Jochem Loedeman in his MSc, and supervised by Maarten Stol, Tengda Han and myself.
📓: arxiv.org/abs/2210.06466
Visit BMVC poster 532 at 10am today!
Prompt Generation Networks for Input-Space Adaptation of Frozen Vision Transformers
With the introduction of the transformer architecture in computer vision, increasing model scale has been demonstrated as a clear path to achieving performance and robustness gains. However, with mode...
arxiv.org
November 26, 2024 at 7:28 AM
This means we can simply send an adapted RGB image to the server to get a personalised output.
We also show that the gains don't just come from adding a new learnable model, but instead from the interplay between the pretrained one and the PGN.
November 26, 2024 at 7:28 AM
This CNN (e.g. running on a phone) outputs a softmax over a set of learned tokens. These are then combined and used for the adaptation. This allows efficient learning, but also for moving the signal back into pixel-space via pseudo-inverse.
November 26, 2024 at 7:28 AM
Also known as reprogramming, works from @phillipisola.bsky.social showed that even adjusting singular pixels allows adapting a model. We take this one step further and make the input-only adaptation signal dependent on the image itself: We introduce a lightweight CNN, the Prompt Generation Network.
November 26, 2024 at 7:28 AM
LoRA is great but one disadvantage is that if you have 1000s of these adapters and want to serve them in an efficient way, it's very difficult: GPUs are inefficient when you e.g. use one adapter for only one sample in a large batch. The solution is to adapt the model strictly in input-space.
November 26, 2024 at 7:28 AM