Gaëtan Benoit
banner
gaetanbenoit.bsky.social
Gaëtan Benoit
@gaetanbenoit.bsky.social
Postdoc researcher in bioinformatics at Pasteur institute. Scalable methods and software for metagenomics. https://github.com/GaetanBenoitDev
Reposted by Gaëtan Benoit
Now published in Algorithms for Molecular Biology: link.springer.com/article/10.1.... Key message: a tiny CNN model with 7k parameters can capture main splice signals across vertebrates+insect and halves the minimap2 & miniprot junction error rate. I always use this new feature now.
Preprint on "Improving spliced alignment by modeling splice sites with deep learning". It describes minisplice for modeling splice signals. Minimap2 and miniprot now optionally use the predicted scores to improve spliced alignment.
arxiv.org/abs/2506.12986
January 6, 2026 at 11:02 PM
Reposted by Gaëtan Benoit
Now published in Nature Biotechnology:
go.nature.com/44P7nSm
If you missed it, the TL;DR is in my April thread below
January 6, 2026 at 9:38 AM
Reposted by Gaëtan Benoit
💾 Prokka 1.15.6 is released!

This is the last major release of Prokka. But don't be sad, because @oschwengers.bsky.social already has an excellent replacement called Bakta you can migrate to.
#bioinformatics #microbiology #genomics

github.com/tseemann/pro...
Release Heading into the sunset · tseemann/prokka
The future This is probably the last release of Prokka. I won't be making any code changes except bug fixes. I will update the databases occasionally. I strongly recommend you use Bakta by @oschwen...
github.com
December 15, 2025 at 9:09 PM
Reposted by Gaëtan Benoit
Preprint Alert!
With @tmthrz.bsky.social and @rayanchikhi.bsky.social we aim to tackle practical unitigs compression!
A thread:
Inverted colored de Bruijn Graph for practical kmer sets storage https://www.biorxiv.org/content/10.64898/2025.12.08.692073v1
December 15, 2025 at 3:19 PM
Reposted by Gaëtan Benoit
1/9 Just out:

k-mer indexes are the backbone of fast search in genomic data, but many degrade under small k, subsampling, or high diversity.

With Ondřej Sladký and @pavelvesely.bsky.social we asked: can we build one that works efficiently for any k-mer set?
🧮 Just out in Bioinformatics Advances: “FroM Superstring to Indexing: A space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)” 

Full article available: https://doi.org/10.1093/bioadv/vbaf290 

Authors include: @pavelvesely.bsky.social, @brinda.eu
December 5, 2025 at 5:42 PM
Reposted by Gaëtan Benoit
Preprint out! Check out our new long-read metagenomic SNP-caller, SNooPy 😀. Work with Chris Quince. Thread 🧵
👉 www.biorxiv.org/content/10.6...
December 4, 2025 at 1:18 PM
Reposted by Gaëtan Benoit
Preprint alert!

We introduce new ideas to revisit the notion of sampling with window guarantees, also known as minimizers.

A thread:
Minimizer Density revisited: Models and Multiminimizers https://www.biorxiv.org/content/10.1101/2025.11.21.689688v1
December 2, 2025 at 11:12 AM
Reposted by Gaëtan Benoit
We are thrilled to announce the first official release (v0.1.8) of #𝗯𝗲𝗱𝗱𝗲𝗿, the successor to one of our flagship tool, #𝗯𝗲𝗱𝘁𝗼𝗼𝗹𝘀! Based on ideas we conceived of long ago (!), this was achieved thanks to the dedication of Brent Pedersen.

1/n
Intro to Bedder – The Quinlan Lab
quinlanlab.org
December 2, 2025 at 2:28 AM
Reposted by Gaëtan Benoit
Preprint Alert!
We present new strategies to accelerate large-scale document comparison using MinHash-like sketches.

A thread:
Compressed inverted indexes for scalable sequence similarity https://www.biorxiv.org/content/10.1101/2025.11.21.689685v1
December 1, 2025 at 2:59 PM
Reposted by Gaëtan Benoit
Ok; mim (github.com/COMBINE-lab/...) preprint submitted! Excited for folks to see it and share thoughts. The key takeaway; mim allows the quick, one-time, building of a small auxiliary index that then allows scaling gzipped FASTQ parsing linearly in # of threads. 1/2
GitHub - COMBINE-lab/mim: A small, auxiliary index to massively improve parallel fastq parsing
A small, auxiliary index to massively improve parallel fastq parsing - COMBINE-lab/mim
github.com
November 25, 2025 at 2:13 PM
Reposted by Gaëtan Benoit
Yohan Hernandez–Courbevoie presenting REINDEER2 at Seqbim!

For those who missed it, the introduction thread of REINDEER2

bsky.app/profile/npma...
November 24, 2025 at 12:41 PM
Reposted by Gaëtan Benoit
@wytamma.bsky.social : so, it took a little bit of extra time (not the flight back from the CZI meeting), but I decided to just f#&$ing do it, and the basic code to build and parse with the auxiliary fastq index is working (github.com/COMBINE-lab/...). 1/2
GitHub - COMBINE-lab/mim: A small, auxiliary index to massively improve parallel fastq parsing
A small, auxiliary index to massively improve parallel fastq parsing - COMBINE-lab/mim
github.com
November 19, 2025 at 3:01 AM
Reposted by Gaëtan Benoit
“Bin Chicken” is now published in Nature Methods! It substantially improves genome recovery through rational coassembly 🧬🖥️. Applied to public 🌍 metagenomes, we recovered 24,000 novel species 🦠, including 6 new phyla.
doi.org/10.1038/s415...
@benjwoodcroft.bsky.social @rhysnewell.bsky.social
🧵1/6
November 13, 2025 at 10:09 AM
Reposted by Gaëtan Benoit
Metagenomics colleagues!

I'm looking for studies where both Illumina and ONT sequencing were performed on the same samples from soil, human, ruminent, and other sample types for comparison. Bonus if those studies include PacBio data.

Please help and share!
November 11, 2025 at 8:21 PM
Reposted by Gaëtan Benoit
Our method for genome size estimation from long-read overlaps is now published 🥳
academic.oup.com/bioinformati...
Genome size estimation from long read overlaps
AbstractMotivation. Accurate genome size estimation is an important component of genomic analyses such as assembly and coverage calculation, though existin
academic.oup.com
November 7, 2025 at 3:19 AM
Reposted by Gaëtan Benoit
1/6 Movi 2 is here: faster and more space-efficient for pangenome queries. Its fastest mode uses half the memory of Movi 1 while running ~30% faster. github.com/mohsenzakeri...
GitHub - mohsenzakeri/Movi: Fast, Cache-Efficient, and Scalable Queries on Pangenomes
Fast, Cache-Efficient, and Scalable Queries on Pangenomes - mohsenzakeri/Movi
github.com
October 21, 2025 at 8:00 PM
Reposted by Gaëtan Benoit
Ca n'est pas si souvent, un article publié dans Nature met ma communauté à l'honneur (la bioinformatique des séquences). Je vous raconte ?
www.nature.com/articles/d41...
‘Google for DNA’ brings order to biology’s big data
MetaGraph compresses vast data archives into a search engine for scientists, opening up new frontiers of biological discovery.
www.nature.com
October 9, 2025 at 3:00 PM
Reposted by Gaëtan Benoit
Our preprint on our new metagenomic HiFi assembler Alice is out 🥳 Based on a *new sketching method* (🧵1/6)
👉 Preprint www.biorxiv.org/content/10.1...
👉 Github github.com/rolandfaure/...
Alice: fast and haplotype-aware assembly of high-fidelity reads based on MSR sketching
We introduce Mapping-friendly Sequence Reduction (MSR) sketches, a sketching method for high-fidelity (HiFi) long reads, and Alice, an assembler that operates directly on these sketches. MSR produces ...
www.biorxiv.org
October 3, 2025 at 2:51 PM
Reposted by Gaëtan Benoit
New pre-print from the Banfield lab, highlighting an interesting case of 1.5Mb megaplasmids found in human gut.

Plasmid genomes were resolved using #PacBio HiFi sequencing with hifiasm-meta for #metagenome assembly. Host association was detected using epigenetic signals.

doi.org/10.1101/2025...
Megaplasmids associate with Escherichia coli and other Enterobacteriaceae
Humans and animals are ubiquitously colonized by Enterobacteriaceae , a bacterial family that contains both commensals and clinically significant pathogens. Here, we report Enterobacteriaceae megaplas...
doi.org
October 1, 2025 at 4:44 PM
Reposted by Gaëtan Benoit
Alice: fast and haplotype-aware assembly of high-fidelity reads based on MSR sketching https://www.biorxiv.org/content/10.1101/2025.09.29.679204v1
October 1, 2025 at 1:47 AM
Reposted by Gaëtan Benoit
Happy to share that the paper describing Autocycler is now 100% up:
doi.org/10.1093/bioi...
(1/3)
Autocycler: long-read consensus assembly for bacterial genomes
AbstractMotivation. Long-read sequencing enables complete bacterial genome assemblies, but individual assemblers are imperfect and often produce sequence-l
doi.org
September 29, 2025 at 4:11 AM
Reposted by Gaëtan Benoit
Delighted to see our paper studying the evolution of plasmids over the last 100 years, now out! Years of work by Adrian Cazares, also Nick Thomson @sangerinstitute.bsky.social - this version much improved over the preprint. Final version should be open access, apols.
Thread 1/n
September 25, 2025 at 9:29 PM
Reposted by Gaëtan Benoit
Delighted to finally announce a preprint describing the Q100 project! “A complete diploid human genome benchmark for personalized genomics” For which we finished HG002 to near-perfect accuracy: www.biorxiv.org/content/10.1... 🧵[1/14]
A complete diploid human genome benchmark for personalized genomics
Human genome resequencing typically involves mapping reads to a reference genome to call variants; however, this approach suffers from both technical and reference biases, leaving many duplicated and ...
www.biorxiv.org
September 22, 2025 at 5:01 PM
Reposted by Gaëtan Benoit
Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!

Nanopore's getting accurate, but

1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?

with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social

1 / N
High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1
September 7, 2025 at 11:35 PM
Reposted by Gaëtan Benoit
New blog post!

metaMDBG (@gaetanbenoit.bsky.social) and Myloasm (@jimshaw.bsky.social) have had recent releases, so I updated the benchmarks from the Autocycler paper:
rrwick.github.io/2025/09/23/a...

Both tools improved considerably! Time to update your conda environments 😄
Benchmark update: metaMDBG and Myloasm
a blog for miscellaneous bioinformatics stuff
rrwick.github.io
September 23, 2025 at 1:53 AM