Lightnews — Scholar-powered news

Arun Das

@arun-das.bsky.social

Postdoc at @genomescience.bsky.social‬. Scientist working in computer science and genomics. More info: arundas.org .

PhD from Schatz Lab @ JHU. Previously: CS @ Brown. He/His/Him. #YNWA 🍉

Posts Replies Media Videos

Arun Das

@arun-das.bsky.social

Today, for no particular reason at all, it is worth sharing this, as a reminder of what one man's lies can do.

Taken from this resource from my alma mater: costsofwar.watson.brown.edu

(Specific page is: costsofwar.watson.brown.edu/costs/human/...)

The human cost of the post 9/11 wars on people in Afghanistan, Pakistan, Iraq, Syria and Yemen. Between indirect and direct deaths, 4.5-4.7 million people have died, and tens of millions have been displaced.

November 4, 2025 at 4:15 PM

Arun Das

@arun-das.bsky.social

Small victories, but this doesn’t seem to apply to those currently on an H-1B visa.

Wish it was made clear in the initial “proclamation”, before we spent the entire day panicking while trying to figure out a way to get a friend back to the US before midnight.

September 20, 2025 at 9:12 PM

Arun Das

@arun-das.bsky.social

Finally, we compare our placed contigs to loci associated with biomarker traits in the UK Biobank and East London Genes & Health Dataset, and find a number of positions where a placed contig is close to a significant locus.

Comparison of the significant loci associated with HDL cholesterol in the South Asian set in the UK Biobank to that of the East London Genes and Health cohort, alongside our placed contigs against GRCh38.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

We are also able to align existing RNA-Seq data from 140 SAS individuals from MAGE directly to these contigs, allowing us to identify 200 contigs with a high density of RNA-Seq alignments.

BLAST shows that these contigs are highly similar to non-reference human and primate sequences.

Plot showing the distribution of the most aligned-to contigs in our RNA-seq contigs, against the length of the contigs and colored by their population. The contigs vary widely in terms of their RNA-seq alignment density and their lengths.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

We show that the majority of the placements we make are missed by traditional insertion calling tools, but in line with specific large non-reference sequence detection ones.

For the unplaced contigs, BLAST shows that the majority have high similarity to non-reference human and primate sequences.

Comparison of our insertions to Manta. We find an order of magnitude more large insertions than this tool, and comparable amounts to existing large insertion callers.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

We are able to place ~20K contigs against CHM13 through a combination of alignment, mate pair read information and LD.

We find >8,000 instances of a placed contig intersecting one of 106 protein coding genes, and >6,000 placements within 1 Kb of a known GWAS site.

Visualization of placements throughout CHM13. We place contigs all over the genome.

Plot of the number of unique gene intersections per chromosome. We see the number of intersections are largely correlated with chromosome size.

Key genes we find intersected with contigs.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

We validate >80% of these contigs in a subset of 21 SAS individuals using auxiliary long read data.

We repeat the linear pipeline with the HPRC v1 draft pangenomes, and see further improvements in alignment but only small reductions in the amount of assembled sequence.

$Plot of the fraction of assembled contigs from each of 21 SAS individuals that are validated by their long read assembly, ~85% of the contigs per individual are validated.$ Plot of the amount of assembled sequence per individual across two linear and two pangenome references. Massive reductions in sequence are seen as we go from GRCh8 to CHM13, but the HPRC pangenomes offer only a slight improvement after that.

Plot of the amount of assembled sequence per individual across two linear and two pangenome references. Massive reductions in sequence are seen as we go from GRCh8 to CHM13, but the HPRC pangenomes offer only a slight improvement after that.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

Despite improvements in alignment compared to GRCh38, we assemble ~600 Kb of sequence in >1 Kb contigs per individual from unmapped reads against T2T-CHM13.

Across the whole set, we assemble 410 Mb of sequence in 199K contigs (which collapses down to 50 Mb when accounting for shared sequence).

Comparison of alignment rate against GRCh38 and CHM13. We see a 0.5-1% improvement against CHM13.

Histogram of amount of assembled sequence across 640 SAS individuals. We assemble on average 550-600 Kb per individual from unmapped reads.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

To do this, we align existing short read data from 640 South Asian (SAS) individuals from 1KGP and SGDP against linear & pangenome references, and assemble the unmapped reads into large contigs.

We then attempt to analyze the functional impact of these sequences.

Our analysis pipeline, which consists of 1) Aligning existing reads against reference genomes, 2) assembling unaligned or poorly aligned reads, 3) placing the large assembled contigs back into the reference, 4) calling variants and novel sequence, and 5) evaluating the functional impact of this variation.

Source of our data. 601 individuals come from 5 1KGP populations, 39 come from 19 SGDP populations.

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

South Asians are severely underrepresented in genomics, and this lack of representation makes it difficult to catalog and understand the variation present in these communities.

Our goal was to investigate the variation present in these populations that is missing in widely used reference genomes.

$Comparison of the fraction of individuals in the GWAS catalog of different ancestries to the breakdown of the global population in terms of those ancestries. South Asians accounted for 2% of the GWAS catalog in 2019, but for >25% of the global population.$

May 15, 2025 at 2:19 PM

Arun Das

@arun-das.bsky.social

In that work, we proposed a range of sketching and sampling approaches for classifying reads from metagenomic experiments without the overhead traditionally associated with alignment- or index-based approaches, and demonstrated that our approaches achieved comparable accuracy to those tools.

April 21, 2025 at 9:24 PM

Arun Das

@arun-das.bsky.social

Sapling utilizes learned index structures to predict the location of a query string within a suffix array, allowing for fast and accurate predictions that bypass the slow lookups often encountered in binary search.

April 21, 2025 at 9:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news