@ensembl.org
@enasequence.bsky.social
@pdbeurope.bsky.social
@pride-ebi.bsky.social
@interprodb.bsky.social
@chembl.bsky.social
@intact-ebi.bsky.social
@gwascatalog.bsky.social
@ensembl.org
@enasequence.bsky.social
@pdbeurope.bsky.social
@pride-ebi.bsky.social
@interprodb.bsky.social
@chembl.bsky.social
@intact-ebi.bsky.social
@gwascatalog.bsky.social
Finding ~48.5M unique links, enriching key databases and updated daily to fuel therapeutic R&D.
Read more 👉 europepmc.org/article/MED/...
#AI4Science #BioNLP #DrugDiscovery
Finding ~48.5M unique links, enriching key databases and updated daily to fuel therapeutic R&D.
Read more 👉 europepmc.org/article/MED/...
#AI4Science #BioNLP #DrugDiscovery
Lit-OTAR even helped identify new disease synonyms, like 'T2D’, enhanced data processing, and improve analyses with FAERS 🎯
But disease names remain tricky, more variation = more missed matches, highlighting areas for future work!
#AI4Science #BioNLP
Lit-OTAR even helped identify new disease synonyms, like 'T2D’, enhanced data processing, and improve analyses with FAERS 🎯
But disease names remain tricky, more variation = more missed matches, highlighting areas for future work!
#AI4Science #BioNLP
We mapped entities:
🧬 Genes → Ensembl
🦠 Diseases → EFO
💊 Drugs → ChEMBL
Over 220M disease mentions were tagged, ~76.6% successfully normalised. But that’s only 7.6% of unique terms... the long tail of rare or variant terms is real 😅
#AI4Science #BioNLP
We mapped entities:
🧬 Genes → Ensembl
🦠 Diseases → EFO
💊 Drugs → ChEMBL
Over 220M disease mentions were tagged, ~76.6% successfully normalised. But that’s only 7.6% of unique terms... the long tail of rare or variant terms is real 😅
#AI4Science #BioNLP
SpaCy was faster but slightly less precise, still a solid choice for lightweight applications.
The old dictionary method? High recall in some spots, but much lower precision.
QEB8L struck the best balance, with high overlap to our gold standard.
#AI4Science #BioNLP
SpaCy was faster but slightly less precise, still a solid choice for lightweight applications.
The old dictionary method? High recall in some spots, but much lower precision.
QEB8L struck the best balance, with high overlap to our gold standard.
#AI4Science #BioNLP
First up: Entity Recognition
BioBERT topped the charts for precision (0.90-0.93) 🔬
But it’s computationally heavy…
So we optimised Bioformer-8L into QEB8L = 10× faster, 77MB model size, and still scoring 0.85–0.94 precision and 0.88 - 0.89 F1 🙌
#AI4Science #BioNLP #TextMining
First up: Entity Recognition
BioBERT topped the charts for precision (0.90-0.93) 🔬
But it’s computationally heavy…
So we optimised Bioformer-8L into QEB8L = 10× faster, 77MB model size, and still scoring 0.85–0.94 precision and 0.88 - 0.89 F1 🙌
#AI4Science #BioNLP #TextMining
This approach can miss associations across sentences, for example coreference or inferred context, which can affect the comprehensiveness.
But this approach is flexible, scalable, and customisable by you!
#AI4Science #BioNLP #TextMining
This approach can miss associations across sentences, for example coreference or inferred context, which can affect the comprehensiveness.
But this approach is flexible, scalable, and customisable by you!
#AI4Science #BioNLP #TextMining
We ran a study with expert curators, but they disagreed often (Cohen’s Kappa = low), showing how subjective this can be.
So we treat co-occurrence in a sentence as a potential association. So you can apply post-processing to fit your needs!
We ran a study with expert curators, but they disagreed often (Cohen’s Kappa = low), showing how subjective this can be.
So we treat co-occurrence in a sentence as a potential association. So you can apply post-processing to fit your needs!
🧬 genes/proteins
🦠 diseases
💊 chemicals/drugs
🧫 organisms
We tested models BioBERT, Bioformer & custom SpaCy
#AI4Science #BioNLP
🧬 genes/proteins
🦠 diseases
💊 chemicals/drugs
🧫 organisms
We tested models BioBERT, Bioformer & custom SpaCy
#AI4Science #BioNLP
We use 39M abstracts and 4.5M full-texts from Europe PMC, limited to CC0/CC-BY original research.
A deep learning model tags key biomedical entities daily 💊
If two terms appear in the same sentence, we treat them as an association. More on that in a sec 👇
We use 39M abstracts and 4.5M full-texts from Europe PMC, limited to CC0/CC-BY original research.
A deep learning model tags key biomedical entities daily 💊
If two terms appear in the same sentence, we treat them as an association. More on that in a sec 👇
1. Europe PMC text-mining articles for evidence
2. @opentargets.org normalising and mapping the evidence to key databases like Ensembl and ChEMBL, and ranking associations
The goal? A scalable service speeding up target validation 🔬
#DrugDiscovery
1. Europe PMC text-mining articles for evidence
2. @opentargets.org normalising and mapping the evidence to key databases like Ensembl and ChEMBL, and ranking associations
The goal? A scalable service speeding up target validation 🔬
#DrugDiscovery
That’s where #EuropePMC comes in, we used text-mining to extract evidence from over 39M abstracts and 4.5M full-text articles.
#DrugDiscovery #OpenScience #TextMining
That’s where #EuropePMC comes in, we used text-mining to extract evidence from over 39M abstracts and 4.5M full-text articles.
#DrugDiscovery #OpenScience #TextMining
Including evidence from:
🧬genetic associations
🧫somatic mutations
💊known drugs
🔬differential expression
🐀animal models
🖥️Text-mining and more!
Including evidence from:
🧬genetic associations
🧫somatic mutations
💊known drugs
🔬differential expression
🐀animal models
🖥️Text-mining and more!
You need to connect the dots between genes/proteins, diseases and drugs using evidence from various sources, including genetic studies and clinical trials…..yep that's millions of papers you need to search 🤯
You need to connect the dots between genes/proteins, diseases and drugs using evidence from various sources, including genetic studies and clinical trials…..yep that's millions of papers you need to search 🤯