Europe PMC
banner
europepmc.org
Europe PMC
@europepmc.org
Europe PMC provides comprehensive access to life sciences literature from trusted sources. It's available to anyone, anywhere for free. https://europepmc.org/
December 4, 2025 at 11:56 AM
The Europe PMC and @opentargets.org collab, Lit-OTAR, powers drug discovery by mining biomedical literature at scale.

Finding ~48.5M unique links, enriching key databases and updated daily to fuel therapeutic R&D.

Read more 👉 europepmc.org/article/MED/...
#AI4Science #BioNLP #DrugDiscovery
Lit-OTAR framework for extracting biological evidences from literature.
Free full text in Europe PMC
europepmc.org
December 3, 2025 at 9:24 AM
Still, normalisation was strong overall 💪

Lit-OTAR even helped identify new disease synonyms, like 'T2D’, enhanced data processing, and improve analyses with FAERS 🎯

But disease names remain tricky, more variation = more missed matches, highlighting areas for future work!

#AI4Science #BioNLP
December 3, 2025 at 9:24 AM
Now onto entity normalisation

We mapped entities:
🧬 Genes → Ensembl
🦠 Diseases → EFO
💊 Drugs → ChEMBL
Over 220M disease mentions were tagged, ~76.6% successfully normalised. But that’s only 7.6% of unique terms... the long tail of rare or variant terms is real 😅

#AI4Science #BioNLP
December 3, 2025 at 9:24 AM
What about other models?

SpaCy was faster but slightly less precise, still a solid choice for lightweight applications.

The old dictionary method? High recall in some spots, but much lower precision.

QEB8L struck the best balance, with high overlap to our gold standard.

#AI4Science #BioNLP
December 3, 2025 at 9:24 AM
Let’s talk results!

First up: Entity Recognition
BioBERT topped the charts for precision (0.90-0.93) 🔬
But it’s computationally heavy…

So we optimised Bioformer-8L into QEB8L = 10× faster, 77MB model size, and still scoring 0.85–0.94 precision and 0.88 - 0.89 F1 🙌
#AI4Science #BioNLP #TextMining
December 3, 2025 at 9:24 AM
But there’s a trade-off.

This approach can miss associations across sentences, for example coreference or inferred context, which can affect the comprehensiveness.

But this approach is flexible, scalable, and customisable by you!

#AI4Science #BioNLP #TextMining
December 3, 2025 at 9:24 AM
But what counts as an association? That’s tricky 🤔

We ran a study with expert curators, but they disagreed often (Cohen’s Kappa = low), showing how subjective this can be.

So we treat co-occurrence in a sentence as a potential association. So you can apply post-processing to fit your needs!
December 3, 2025 at 9:24 AM
For entity recognition, we trained models using a combined dataset from Europe PMC and CHEMDNER to detect:

🧬 genes/proteins
🦠 diseases
💊 chemicals/drugs
🧫 organisms
We tested models BioBERT, Bioformer & custom SpaCy
#AI4Science #BioNLP
December 3, 2025 at 9:24 AM
So how does Lit-OTAR actually work?
We use 39M abstracts and 4.5M full-texts from Europe PMC, limited to CC0/CC-BY original research.

A deep learning model tags key biomedical entities daily 💊

If two terms appear in the same sentence, we treat them as an association. More on that in a sec 👇
December 3, 2025 at 9:24 AM
Lit-OTAR has two main parts working together:

1. Europe PMC text-mining articles for evidence

2. @opentargets.org normalising and mapping the evidence to key databases like Ensembl and ChEMBL, and ranking associations

The goal? A scalable service speeding up target validation 🔬
#DrugDiscovery
December 3, 2025 at 9:24 AM
A key part of this? Extracting evidence directly from the scientific literature 🔎

That’s where #EuropePMC comes in, we used text-mining to extract evidence from over 39M abstracts and 4.5M full-text articles.

#DrugDiscovery #OpenScience #TextMining
December 3, 2025 at 9:24 AM
That’s where @opentargets.org steps in, combining data from 20+ sources to provide target-disease associations with linked evidence all in one place!

Including evidence from:
🧬genetic associations
🧫somatic mutations
💊known drugs
🔬differential expression
🐀animal models
🖥️Text-mining and more!
December 3, 2025 at 9:24 AM
Identifying drug targets is one of the toughest and important steps in drug discovery!

You need to connect the dots between genes/proteins, diseases and drugs using evidence from various sources, including genetic studies and clinical trials…..yep that's millions of papers you need to search 🤯
December 3, 2025 at 9:24 AM