Hello,
I also cross-posted here: https://github.com/hbctraining/Intro-to-rnaseq-hpc-salmon-flipped/issues/30 .
I use salmon for transcript-level quantification on RNA-seq data which I then want to reduce to gene-level for further analysis. I do so using tximport.
I noticed that after tximport, some genes with the same name are assigned to different gene ids in different replicas. That makes differential expression on this data hard, as lot of the top upregulated genes found by deseq2 actually stem from an unlucky assignment.
To make this more clear, one of my most upregulated genes is ENSG00000232326 (TAP2), but there are only counts for that gene in a single sample. During further inspection, I noticed that there are also counts assigned to ENSG00000228582, ENSG00000206235, ENSG00000223481, ENSG00000204267 - all of which have the gene name TAP2! I'm not quite sure why there is this level of duplication, but the genes seem to exist on different (virtual?) chromosomes. I'd like to have those assigned to the same gene before differential expression. I think that ENSG00000204267 is a sensible target gene for TAP2 as that one is on the primary assembly.
I had some ideas but I am unsure how to proceed.
1) In the quant.sf , I found that transcript ENST00000457634 was assigned to ENSG00000232326. I see that transcript in the cdna.all.fa (that I used as suggested in 08_quasi_alignment_salmon.md, downloaded from http://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
), so I know where that's coming from:
$ grep -i 457634 Homo_sapiens.GRCh38.cdna.all.fa
>ENST00000457634.6 cdna scaffold:GRCh38:HSCHR6_MHC_SSTO_CTG1:4224836:4238030:-1 gene:ENSG00000232326.9 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:TAP2 description:transporter 2, ATP binding cassette subfamily B member [Source:HGNC Symbol;Acc:HGNC:44]
Assembling the annotations from AnnotationHub (as suggested here: https://hbctraining.github.io/DGE_workshop_salmon/lessons/genomic_annotation.html), I could also find that mapping, so that's consistent.
I do not see that transcript in the gtf/gff3 though (obtained from http://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz
and https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/Homo_sapiens.GRCh38.111.gff3.gz respectively
).
$ grep -i 00000457634 Homo_sapiens.GRCh38.111.gff3
$ grep -i 00000457634 Homo_sapiens.GRCh38.111.gtf
So I assume that this transcript and gene are not in the primary assembly and are therefore variants. So it appears easiest to me to just use a "cdna.all.fa" that only contains cDNA from the primary assembly. Is my reasoning correct and if so, where can I obtain such a file?
Then, I should be able to obtain my index from the gtf as suggested here: transcripts missing from tx2gene
2) A different idea is to take the salmon output as is and use a different way of transcript-gene mapping.
I thought about obtaining the mapping "normally" (as in https://hbctraining.github.io/DGE_workshop_salmon/lessons/genomic_annotation.html), then the resulting data frame looks like this:
| tx_id | gene_id | gene_name | entrezid
<...>
34379 | ENST00000612574 | ENSG00000206299 | TAP2 | 6891
34380 | ENST00000383239 | ENSG00000206299 | TAP2 | 6891
34388 | ENST00000485516 | ENSG00000206299 | TAP2 | 6891
34407 | ENST00000469427 | ENSG00000206299 | TAP2 | 6891
34494 | ENST00000451907 | ENSG00000223481 | TAP2 | 6891
34498 | ENST00000452371 | ENSG00000223481 | TAP2 | 6891
34504 | ENST00000462883 | ENSG00000223481 | TAP2 | 6891
34511 | ENST00000468636 | ENSG00000223481 | TAP2 | 6891
34624 | ENST00000445986 | ENSG00000237599 | TAP2 | 6891
<...>
... and I would change the gene_id to a common one - favorably the one on the primary assembly. Is that "okay" and how would I find out which gene_id is the correct one (i.e. which one is on the primary assembly)?
3) I could also build the salmon index from the primary assembly, not the cDNA. In the hbctraining tutorials that I followed this is done like this for the STAR index and mapping (https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html). Do you see issues with that regarding salmon? This way, I'd make sure that only the primary assembly and no variants are available for Salmon to map to.
4) Other ideas are welcome! If you use Salmon yourself, have you also run into this problem?