Jeremie Kalfon 👨‍💻🧬🤖🚀
banner
jkobject.com
Jeremie Kalfon 👨‍💻🧬🤖🚀
@jkobject.com
Doing a Ph.D. AI in Bio. | Ex @WhiteLabGx @BroadInstitute @MIT | Built @PiPleteam | ML, Cancer, Genomics, Data Sci, Entrepreneur, FullStack Dev | All views are mine
The first 1 million prime numbers vizualized in 2D according to their prime factors (Umap)

Source: johnhw.github.io/uma...
September 25, 2025 at 8:05 AM
what they sell you, what you get...
September 20, 2025 at 9:54 AM
By using common fine-tuning mechanism we show how one can train from one scale to the next by back-propagating signal to the compressed tokens and lower scale model.
June 20, 2025 at 9:06 AM
Each group of biologist are on their own niche and so too are the models. But These models talk about different steps of the same stair.

We present ideas on how we might end up training models from atoms to organs by using transformers to compress 🔺 🔻 data into tokens used by larger scale models
June 20, 2025 at 9:06 AM
Very happy to share that my new paper got accepted at the ICML workshop for Foundation model for Life Sciences!!
www.biorxiv.org/cont...

Foundation Models are being trained from atoms to molecules ⚛️, molecule chains 🧬, entire cells 🦠, and even groups of cell across tissue slices 🫁
June 20, 2025 at 9:06 AM
As part of my research, I believe that scientific outreach is essential. Last week, I had the pleasure of presenting how AI can help us understand biology and the cell at Pint of Science 2025!

I also put together a short video recap (in French) for those curious: youtu.be/fc8L8Dn_7tw...
1/2
May 23, 2025 at 8:54 AM
As part of my research, I believe that scientific outreach is essential. Last week, I had the pleasure of presenting how AI can help us understand biology and the cell at Pint of Science 2025!

I also put together a short video recap (in French) for those curious: youtu.be/fc8L8Dn_7tw...
1/2
May 23, 2025 at 8:44 AM
Thanks again team and congrats on the First Place!!

6/6
December 9, 2024 at 9:18 AM
Nothing is publication worthy of course and many important problems were not solved like batch effects, adaptive patch size etc. But still seeing what can be done with drive and elbow grease makes me optimistic about the future!

5/6
December 9, 2024 at 9:18 AM
Secondly we have ran scPRINT on gene panel ST datasets like Xenium to predict cell level cell type, disease, and impute the remaining gene's expression. With surprising ability to predict some cancerous cells in BRCA slides. Many other ideas were unfinished.
4/6
December 9, 2024 at 9:18 AM
Finding similar patients through their slides or slide subsets, To retrieve associated molecular profiles, disease subtypes, and treatment and health journey. Our tool automatically downloads the HEST1K database and generates anndata with patch embeddings.
3/6
December 9, 2024 at 9:18 AM
We manage to introduce a pipeline mixing spatialdata, #scPRINT #HEST1K #CONCH #SPATIALDATA and #NAPARI. Our first POC was STsimilarity: finding similar image patches across a large database of histopatologic slides.
2/6
December 9, 2024 at 9:18 AM
November 19, 2024 at 8:06 PM
I thus decided to formalize it by creating GRnnData: First it is a tool to import and store many different network format to an AnnData. 💁 But it also contains more bells and whistles to work with gene networks! (like subsetting some genes, extracting targets, plotting 💹. 5/6
November 19, 2024 at 12:30 PM
Interestingly, there is a possible standard for it! 🎉AnnData contains the .varp field which is made to store var to var (e.g. genes to gene) relationships. However not many people use it… 4/6
November 19, 2024 at 12:30 PM
However, working with gene networks, I have seen various ways to store them throughout the different papers and benchmarks. Often as some kind of tsv/csv/… file with some kind of a gene-gene list. This lack of standard makes it quite hard to work with gene networks🙉 3/6
November 19, 2024 at 12:30 PM
Multi-modal AI in Biology will certainly be some kind of multi-scale approach where each modalitity is feeding into the next.
Transformer use embeddings of element they look at. each can be produced by the previous scale transformer model. going from molecules to whole tissues!
November 19, 2024 at 12:23 PM
💯 🙏 Also, I would like to acknowledge the important pioneering work from Geneformer, UCE, scFoundation and scGPT. Thanks to FlashAttention, pytorch, lightning, and scanpy for their toolkits. Thanks to Omnipath, Scenic+, Openproblems, Replogle et al. and Mc Calla et al.
November 19, 2024 at 12:21 PM
We propose to use the specificity of scRNAseq data and define a multi task pre-training composed of expression denoising, bottleneck learning and classification.

We also propose a new hierarchical classification method to work with the rich hierarchical ontologies used to label cells in cellxgene.
November 19, 2024 at 12:21 PM
-> it is for now a pre-print and more is to come but here are some of our results:

scPRINT is a transformer model trained on 50M cells 🦠 from the cellxgene database, it has novel expression encoding and decoding schemes and new pre-training methodologies 🤖.

www.biorxiv.org/content/10.1...
November 19, 2024 at 12:21 PM