Lightnews — Scholar-powered news

Michael Tress

@michaeltress.bsky.social

In addition to the 35 new coding ORFs, we also found evidence for 279 alternative isoforms and 99 translated upstream regions. The vast majority of the upstream translations were validated by their peptides. Translation from upstream regions is more common than is currently thought (see paper above)

Workflow from the paper showing the the numbers and types of regions not in GENCODE that had PeptideAtlas support.

November 24, 2025 at 4:06 PM

Michael Tress

@michaeltress.bsky.social

We report our results to GENCODE, so 10 genes were already annotated as coding prior to the paper. However, we believe that only 7 are coding. The annotation of LINC03040 and MYH16 as coding was premature and no POM121L1P repeats should have been annotated because peptides map to at least 8 regions.

Numbers of genes in each of the three groups. In yellow the numbers of genes we found annotated as coding by GENCODE in v48 (G48). In reality we disagree with more than just 3 genes, since GENCODE annotated eight or nine copies of the POM121L1P repeats as coding along with the three paralogues of ENSG00000293661, at least two of which are highly likely to be pseudogenes.

November 24, 2025 at 4:03 PM

Michael Tress

@michaeltress.bsky.social

None of these new genes are entirely novel because they all had to have been “discovered” at some point to be included in the PeptideAtlas search database. None of the 16 genes we believe are likely to be coding were annotated in RefSeq either, but 8 were included in the UniProtKB human proteome.

November 24, 2025 at 3:59 PM

Michael Tress

@michaeltress.bsky.social

Finally peptides for 5 predicted proteins mapped to multiple regions in the genome. We believe that most of these peptides were also produced by aberrant translation. LINE 1 ORF1 would be a good example, present in hundreds of regions and with more than 50 peptides in cancers.

November 24, 2025 at 3:58 PM

Michael Tress

@michaeltress.bsky.social

None of the 16 the genes that we found peptide evidence for had evolved ab initio within primates.

On the left the breakdown of the origen of the 16 likely coding genes - 10 gene duplications (GD), six retroviral ORFs (TE) and none de novo. Contrast this with the other 19 regions with peptide support

November 24, 2025 at 3:53 PM

Michael Tress

@michaeltress.bsky.social

Another 6 coding genes derived from retroviruses, and 3 of these were detected exclusively in placenta. This is remarkable because up to now all well-known co-opted retroviral genes in human placenta were derived from env ORFs. All three PeptideAtlas-supported novel ORFs were ERV gag ORFs.

Alignment of the three placenta-expressed ERV gag ORFs supported by PeptideAtlas peptides. Residues are coloured by structural domains from the related AlphaFold model of ERVFRD-2 from the paper.

November 24, 2025 at 3:49 PM

Michael Tress

@michaeltress.bsky.social

Like several other gene duplications, Trembl entry Q3ZM62 (now ENSG00000293661) is expressed in testis. It is a eutherian paralogue of ETDA and this pair of genes have three more copies on chromosome X in human. However, there is no evidence to suggest that any of the other six genes are coding.

The relative positions on chromosome X of the four ETDA paralogues (in blue) along with the four paralogues of ENSG00000293661 (yellow).

November 24, 2025 at 3:46 PM

Michael Tress

@michaeltress.bsky.social

The other 14 genes can be split into two groups. Eight derived from gene duplications. Many of these have undergone considerable changes and may have been pseudogenes prior to gaining novel function. These genes include C5orf60 (now SPATA31J1), CFAP144P1, MSL3P1 (now MSL3B ) and ZNF840P.

Predicted structures from AlphaFold for genes ZNF840P and CFAP144P1 with the detected PeptideAtlas peptides mapped in yellow. On the right, and analysis of one of the CFAP144P1 peptides by Vseq.

November 24, 2025 at 3:43 PM

Michael Tress

@michaeltress.bsky.social

These 35 potential new coding genes were found by examining PeptideAtlas peptides that did not map to GENCODE isoforms. Where those peptides mapped to regions outside of known coding gene loci, we carried out detailed analyses of the PSM, conservation, expression, potential function …

Codalignview analysis of one of the 16 coding genes

November 24, 2025 at 3:38 PM

Michael Tress

@michaeltress.bsky.social

I like this one! You are absolutely right that there is more evidence for the 805aa isoform. But there are two NAGNAG splice events in RASAL1 and the extra amino acid exon has way more support in splice events. Which makes the 806aa isoform the principal. APPRIS will change to reflect this.

October 6, 2025 at 11:30 AM

Michael Tress

@michaeltress.bsky.social

Not everyone as it turns out. APPRIS has the 353 aa isoform as principal

October 3, 2025 at 8:29 PM

Michael Tress

@michaeltress.bsky.social

This one had me stumped for a while. They are both wrong, as is the other 183 aa amino acid isoform annotated by @ensembl - none of the upstream exons or ATGs are conserved across primates.

APPRIS does get it right, but only in RefSeq because RefSeq annotates the downstream ATG (169 aa isoform):

October 3, 2025 at 7:54 PM

Michael Tress

@michaeltress.bsky.social

Not seeing this case. There are peptides for the extra mini-exon in the APPRIS principal isoform. Besides this last exon codes for a crucial internal strand, which if missing would probably make ADH6 a pseudogene.

ADH6 seems to be an old world monkey duplication of ADH7, which has mutated a LOT.

October 3, 2025 at 5:59 PM

Michael Tress

@michaeltress.bsky.social

So, still a lot of work to be done to get down a final agreed set of coding genes. One big step would be to eliminate all readthrough genes. Let's see ...

And that winds up our work in @gencodegenes.bsky.social

It has been fun.

Work carried out by Miguel Maquedano and @danielcerdan.bsky.social

September 30, 2025 at 3:39 PM

Michael Tress

@michaeltress.bsky.social

... and there are also many genes in the intersection between the three sets that are "legacy" coding genes and should be scrubbed from all three reference sets, these include HEPN1, BLID, PBOV1, GNG14, GJE1, HIGD2B, FTCDNL1 and the TP53TG3 family as we show in detail in the paper.

A. Aligned exons for FTCDNL1 genes from distinct primate species indicsting whether the exon has a frame shift (orange), premature stop codon (red), a missing ATG (blue), missing splice site (yellow), has been lost entirely (purple) or is intact (green). B. the AlphaFold model of cow FTCDNL1 (intact). C and D. Alphafold models of human FTCDNL1 isoforms are missing a whole lobe of the cow FTCDNL1 structure and have poor LDDT scores for what would be the core of the protein (these core strands are marked in yellow)

September 30, 2025 at 3:26 PM

Michael Tress

@michaeltress.bsky.social

We also generated a set of 9 potential non-coding features for this paper.

With the exception of the agreements between UniProtKB and Ensembl/GENCODE (mostly Ig/TcR fragments), most genes outside of the 3-way intersection were tagged as potential non-coding.

These are probably not coding genes.

Almost 100% of GENCODE-unique genes are potential non-coding (PNC). UniProtKB and RefSeq unique genes have fewer PNC genes, but only because the PNC features were designed for Ensembl/GENCODE genes.

September 30, 2025 at 3:12 PM

Michael Tress

@michaeltress.bsky.social

Ensembl/GENCODE is the biggest beneficiary of the changes - 99.5% of all the annotated coding genes in GENCODE v45 were in agreement with both RefSeq and UniProtKB.

With the caveat that the Ensembl/GENCODE set is also the worst reference set for readthrough "coding" genes (see start of thread).

GENCODE v45 had 655 readthrough "coding" genes (yikes!)

September 30, 2025 at 2:57 PM

Michael Tress

@michaeltress.bsky.social

Without these genes, it is clear that all three reference sets annotate fewer coding genes, yet agree on almost 250 more coding genes than in 2018.

A comparison between the three reference set merges carried out in 2018 (left) and 2024 (right) with all readthrough genes and Ig and Tcr fragments removed. The 3 reference sets agree on 250 more coding genes.

September 30, 2025 at 2:51 PM

Michael Tress

@michaeltress.bsky.social

So, our definitive paper on the human reference gene set is out this week in Database (Oxford).

We merged and compared @ensembl.org / @gencodegenes.bsky.social , RefSeq and UniProtKB coding genes and investigated the agreements and discrepancies.

Details of what we found in the thread ...

Counts of the merged Ensembl/GENCODE, RefSeq and UniProtKB coding genes with the most likely explanations for the outliers for each gene set. For example, most Ensembl/GENCODE singletons are readthrough genes.

September 30, 2025 at 2:25 PM

Michael Tress

@michaeltress.bsky.social

Hi, if you find any more of these genes, can you tag @appris.bsky.social so that we can look at them? We have a manually annotated section now and we can make changes almost immediately. We agree with UniProt on this one:

APPRIS chooses the 172 amino acid isoform as the principal

September 29, 2025 at 4:52 PM

Michael Tress

@michaeltress.bsky.social

We also find a novel GPRIN2 gene, which has more support than its paralogue. In this case, both genes may be coding.

Segmental duplications on chromosome 10. Duplication and translocation in the q arm of chromosome 10. At least four gene blocks have duplicated in this region since the last common ancestor of humans and chimpanzees. Genes are shown as arrows indicating direction of strand, coding genes are orange arrows, pseudogenes pink. The GPRIN2L paralogue uncovered by the T2T-CHM13 assembly is shown in light blue. The four gene blocks that have duplicated are colour coded. The light green background blocks include the WASH2 genes, the light blue background the GPRIN2 paralogues. The AGAP-FAM25 block (orange background) appears to have duplicated three times with multiple genes in this region alone. Genes in the yellow background block are all pseudogenes. Several of the gene blocks are contiguous. There is no gap between the gene blocks when the blocks are contiguous. The approximate coordinates of the contiguous blocks are indicated above the blocks.

January 7, 2025 at 11:15 AM

Michael Tress

@michaeltress.bsky.social

All 12 WASH1C paralogues, the 5 annotated by UniprotKB as coding and 7 more in the T2T regions of distinct chromosomes, were shown to have duplicated in the human lineage and all could be traced back to chr. 8 WASH1C (a pseudogene) and not to WASH1-20p13.

All 12 WASH12C paralogues are pseudogenes.

Phylogenetic tree of great ape and human genes showing that the WASH1-20p13 localises with other great ape WASH1C genes, while the 12 paralogues bunch together in a separate part of the tree. Genes newly annotated in T2T-CHM13 assembly are labelled with their RefSeq gene names, the likely WASH1 gene, LOC124908094, is highlighted. Great ape WASH1 genes are labelled with the chromosome number in which they are annotated.

January 7, 2025 at 11:09 AM

Michael Tress

@michaeltress.bsky.social

On top of that, we show that all 12 WASH1C paralogues of the chromosome 20 p arm gene (WASH1-20p13) have multiple mutations in conserved regions of WASH1C suggesting that only WASH1-20p13 conserves WASHC1's role in the WASH complex.

The number of non-conserved amino acids (single amino acid variations, SAAVs) and deleted regions in the five full length WASH1 protein isoforms that differ from amino acids that are conserved across primates, mammals and tetrapods. WASH1-20q33 is the predicted protein on the q arm of chromosome 20. WASH1-20p13 has no mutations in conserved regions.

January 7, 2025 at 11:01 AM

Michael Tress

@michaeltress.bsky.social

Peptide evidence supports the WASHC1 isoform found in the subtelomeric region of the p arm of chromosome 20, and does not support the translation any of the other 12 WASHC1 paralogues, including the WASHC1 gene on chromosome 9, which was thought to be coding.

@bioinfoadv.bsky.social

Alignment of WASH1C paralogues annotated as coding by UniProtKB and the WASH1C isoform from chromosome 20. The chromosome 20 gene and known SNPs capture all the PeptideAtlas peptides. Mapping peptides to WASH1-20p13 and WASH1 sequences in UniProtKB. Peptides are mapped to alignments between the five UniProtKB annotated WASH1 sequences and the newly annotated WASH1-20p13 protein. Residues that peptides map to are colour coded by the number of observations detected for that protein. Dark red, red, and orange fonts indicate the most observed peptides. Residues with a yellow background indicate the position of stop codons and frameshifts in WASH2P and WASH3P. Grey sequence indicates the parts of the predicted sequence of WASH2P and WASH3P that cannot be translated because of these stop codons and frameshifts. Peptides with a light green background and text in bold show those peptides that map uniquely to one of the six sequences. Amino acids columns with a blue background indicate the position of single amino acid differences between the predicted proteins. Deletions relative to ancestral sequences are marked with a pink background.

January 7, 2025 at 10:52 AM

Michael Tress

@michaeltress.bsky.social

This time we also tagged the RefSeq and UniProtKB singleton coding genes with PNC features. Even though these features were designed to label GENCODE coding genes, we still tagged 63.1% of RefSeq genes not in the intersection and 54.7% of UniProtKB genes as potential non-coding.

The percentage of the genes in different gene sets that are tagged as potential non-coding genes. The genes in each set are those shown in figure 1B, “GENCODE [G]” are all Ensembl/GENCODE genes, “RefSeq [R]” are all RefSeq genes, and “UniProtKB [U]” are all UniProtKB genes. “G. not intersect” are all Ensembl/GENCODE genes that are not in the intersection between the three sets. “R. not intersect” are all RefSeq genes that are not in the intersection between the three sets. “U. not intersect” are all UniProtKB genes that are not in the intersection between the three sets.

December 10, 2024 at 12:58 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news