Arun Das
arun-das.bsky.social
Arun Das
@arun-das.bsky.social
Postdoc at @genomescience.bsky.social‬. Scientist working in computer science and genomics. More info: arundas.org .

PhD from Schatz Lab @ JHU. Previously: CS @ Brown. He/His/Him. #YNWA 🍉
Today, for no particular reason at all, it is worth sharing this, as a reminder of what one man's lies can do.

Taken from this resource from my alma mater: costsofwar.watson.brown.edu

(Specific page is: costsofwar.watson.brown.edu/costs/human/...)
November 4, 2025 at 4:15 PM
Small victories, but this doesn’t seem to apply to those currently on an H-1B visa.

Wish it was made clear in the initial “proclamation”, before we spent the entire day panicking while trying to figure out a way to get a friend back to the US before midnight.
September 20, 2025 at 9:12 PM
Finally, we compare our placed contigs to loci associated with biomarker traits in the UK Biobank and East London Genes & Health Dataset, and find a number of positions where a placed contig is close to a significant locus.
May 15, 2025 at 2:19 PM
We are also able to align existing RNA-Seq data from 140 SAS individuals from MAGE directly to these contigs, allowing us to identify 200 contigs with a high density of RNA-Seq alignments.

BLAST shows that these contigs are highly similar to non-reference human and primate sequences.
May 15, 2025 at 2:19 PM
We show that the majority of the placements we make are missed by traditional insertion calling tools, but in line with specific large non-reference sequence detection ones.

For the unplaced contigs, BLAST shows that the majority have high similarity to non-reference human and primate sequences.
May 15, 2025 at 2:19 PM
We are able to place ~20K contigs against CHM13 through a combination of alignment, mate pair read information and LD.

We find >8,000 instances of a placed contig intersecting one of 106 protein coding genes, and >6,000 placements within 1 Kb of a known GWAS site.
May 15, 2025 at 2:19 PM
We validate >80% of these contigs in a subset of 21 SAS individuals using auxiliary long read data.

We repeat the linear pipeline with the HPRC v1 draft pangenomes, and see further improvements in alignment but only small reductions in the amount of assembled sequence.
May 15, 2025 at 2:19 PM
Despite improvements in alignment compared to GRCh38, we assemble ~600 Kb of sequence in >1 Kb contigs per individual from unmapped reads against T2T-CHM13.

Across the whole set, we assemble 410 Mb of sequence in 199K contigs (which collapses down to 50 Mb when accounting for shared sequence).
May 15, 2025 at 2:19 PM
To do this, we align existing short read data from 640 South Asian (SAS) individuals from 1KGP and SGDP against linear & pangenome references, and assemble the unmapped reads into large contigs.

We then attempt to analyze the functional impact of these sequences.
May 15, 2025 at 2:19 PM
South Asians are severely underrepresented in genomics, and this lack of representation makes it difficult to catalog and understand the variation present in these communities.

Our goal was to investigate the variation present in these populations that is missing in widely used reference genomes.
May 15, 2025 at 2:19 PM
In that work, we proposed a range of sketching and sampling approaches for classifying reads from metagenomic experiments without the overhead traditionally associated with alignment- or index-based approaches, and demonstrated that our approaches achieved comparable accuracy to those tools.
April 21, 2025 at 9:24 PM
Sapling utilizes learned index structures to predict the location of a query string within a suffix array, allowing for fast and accurate predictions that bypass the slow lookups often encountered in binary search.
April 21, 2025 at 9:24 PM