Pranam Chatterjee
banner
pranam.bsky.social
Pranam Chatterjee
@pranam.bsky.social
Designing peptides/proteins to program biology! 🧬💻🧫 Assistant Professor at Duke | Co-Founder of Gameto and UbiquiTx | MIT SB, SM, PhD
Yes, definitely. A learned tokenizer is always more complex. The nice thing about ESM-2 is that it's a per-residue tokenization, and doesn't use BPE, SentencePiece, or some other irrelevant tokenizer. It allows us to get good residue-level embeddings. :)
December 2, 2024 at 3:33 AM
I worry that during pre-training, the token embeddings ended up having quite expressive representations themselves. Using a special token would work, but you would need to really contextualize their token representations, just as the <mask> had. Otherwise, I could imagine a dropoff in performance.
December 2, 2024 at 3:23 AM
Yes we run most of the inference pipelines on A100s and H100s. Haven’t had a problem — A6000s have been fine as well.
November 24, 2024 at 11:04 PM
Ooh such a good idea!! I’ll try it! :)
November 23, 2024 at 9:40 PM
Great points! I actually never liked it either and most of the time, it’s hard to effectively debug with everyone watching. 😅
November 23, 2024 at 7:39 PM
Of course!! Will do! The biggest test will be when we down select generated molecules based on Boltz-1 metrics and we’ll see if they work in the wet lab. 🧫
November 21, 2024 at 1:03 AM
Yeah same. The ByteDance one, Proteinix is quite good and the engineering from them is always clean!
November 19, 2024 at 1:05 PM
Yeah nothing easy about it! And the throughput is low that it’s hard to get a good look at hit rate of the algorithms without doing a mini display assay. Ahh such is life! 😅
November 19, 2024 at 11:25 AM
We usually do some hacky ELISAs via biotinylation of the analyte and then SPR the best ones. A horridly cumbersome set of experiments. 😣
November 19, 2024 at 11:19 AM
Ugh so true!! And as a lab that does peptides, why is it so slow and expensive to synthesize an 18mer is insanity. 🤦🏾‍♂️ Only alternative is to His-tag purify, which also sucks. And don’t get me started with Kd analysis…still no reliable high-throughput binding affinity measurement. 😣
November 19, 2024 at 11:13 AM
Agreed!! We’re using the AF3 models to validate our language model-based binder designs to structured targets (and metals, DNA, etc) prior to experimental testing, as a sort of a hint on performance. But of course, the true test is in the lab for us!! 🧫
November 19, 2024 at 11:04 AM
A strategy that seems to be useful is using heterodimeric PDBs of single proteins and cutting interfaces — there’s a bit more conformational flexibility captured, and our LMs have done better with this noisier data.
December 31, 2023 at 12:29 PM
We’ve worked to create a similar dataset with minimal leakage, but to do interface prediction from pLM residue embeddings. It’s super tough and we’ve yet to find a good train/test cluster-based split that would achieve this.
December 31, 2023 at 12:27 PM