Lightnews — Scholar-powered news

Pranam Chatterjee

@pranam.bsky.social

1.4K followers 150 following 28 posts

Designing peptides/proteins to program biology! 🧬💻🧫 Assistant Professor at Duke | Co-Founder of Gameto and UbiquiTx | MIT SB, SM, PhD

Posts Replies Media Videos

Pranam Chatterjee

@pranam.bsky.social

Yes, definitely. A learned tokenizer is always more complex. The nice thing about ESM-2 is that it's a per-residue tokenization, and doesn't use BPE, SentencePiece, or some other irrelevant tokenizer. It allows us to get good residue-level embeddings. :)

December 2, 2024 at 3:33 AM

Pranam Chatterjee

@pranam.bsky.social

I worry that during pre-training, the token embeddings ended up having quite expressive representations themselves. Using a special token would work, but you would need to really contextualize their token representations, just as the <mask> had. Otherwise, I could imagine a dropoff in performance.

December 2, 2024 at 3:23 AM

Pranam Chatterjee

@pranam.bsky.social

Yes we run most of the inference pipelines on A100s and H100s. Haven’t had a problem — A6000s have been fine as well.

November 24, 2024 at 11:04 PM

Pranam Chatterjee

@pranam.bsky.social

Ooh such a good idea!! I’ll try it! :)

November 23, 2024 at 9:40 PM

Pranam Chatterjee

@pranam.bsky.social

Great points! I actually never liked it either and most of the time, it’s hard to effectively debug with everyone watching. 😅

November 23, 2024 at 7:39 PM

Pranam Chatterjee

@pranam.bsky.social

Of course!! Will do! The biggest test will be when we down select generated molecules based on Boltz-1 metrics and we’ll see if they work in the wet lab. 🧫

November 21, 2024 at 1:03 AM

Pranam Chatterjee

@pranam.bsky.social

Yeah same. The ByteDance one, Proteinix is quite good and the engineering from them is always clean!

November 19, 2024 at 1:05 PM

Pranam Chatterjee

@pranam.bsky.social

Yeah nothing easy about it! And the throughput is low that it’s hard to get a good look at hit rate of the algorithms without doing a mini display assay. Ahh such is life! 😅

November 19, 2024 at 11:25 AM

Pranam Chatterjee

@pranam.bsky.social

We usually do some hacky ELISAs via biotinylation of the analyte and then SPR the best ones. A horridly cumbersome set of experiments. 😣

November 19, 2024 at 11:19 AM

Pranam Chatterjee

@pranam.bsky.social

Ugh so true!! And as a lab that does peptides, why is it so slow and expensive to synthesize an 18mer is insanity. 🤦🏾‍♂️ Only alternative is to His-tag purify, which also sucks. And don’t get me started with Kd analysis…still no reliable high-throughput binding affinity measurement. 😣

November 19, 2024 at 11:13 AM

Pranam Chatterjee

@pranam.bsky.social

Agreed!! We’re using the AF3 models to validate our language model-based binder designs to structured targets (and metals, DNA, etc) prior to experimental testing, as a sort of a hint on performance. But of course, the true test is in the lab for us!! 🧫

November 19, 2024 at 11:04 AM

Pranam Chatterjee

@pranam.bsky.social

A strategy that seems to be useful is using heterodimeric PDBs of single proteins and cutting interfaces — there’s a bit more conformational flexibility captured, and our LMs have done better with this noisier data.

December 31, 2023 at 12:29 PM

Pranam Chatterjee

@pranam.bsky.social

We’ve worked to create a similar dataset with minimal leakage, but to do interface prediction from pLM residue embeddings. It’s super tough and we’ve yet to find a good train/test cluster-based split that would achieve this.

December 31, 2023 at 12:27 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news