Question

Tximport mixing annotations

0

Entering edit mode

7.0 years ago

Tania ▴ 180

Hi Everyone

I used the annotation from "GRCh37_latest_genomic.gff" to build tx2gene while my transcripts are from "GRCh37_latest_rna.fna.gz"

txdb <-makeTxDbFromGFF("~/Desktop/work/databases/GRCh37_latest_genomic.gff.gz")
k <- keys(txdb, keytype = "GENEID")
df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "GENEID")
tx2gene <- df[, 2:1]

I am getting this error, aren't those files correspond to each other?

Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

RNA-Seq tximport txdb • 4.7k views

ADD COMMENT • link 7.0 years ago by Tania ▴ 180

0

Entering edit mode

Hi Tania,

have you tried this command on an unzipped GFF ? Cheers, Michael

ADD REPLY • link 7.0 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Tried that now, doesn't work too. Thanks though !

ADD REPLY • link 7.0 years ago by Tania ▴ 180

0

Entering edit mode

Looking at both of these files, the IDs from the FASTA file that you used are in RefSeq format. Thus, you may have to specify

df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "Name")

or

df <- select(txdb, keys = k,  columns = "TXNAME", keytype = "transcript_id")

I have used tximport in the past but I always curate my own annotation reference. I always use the GENCODE transcripts for alignment and annotation, as these contain HGNC IDs in the FASTA headers.

head GRCh37_latest_rna.fna

NM_000014.5 Homo sapiens alpha-2-macroglobulin (A2M), transcript variant 1, mRNA

ATACAAGAGATGTGAGAAGCACCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGG TTGAAAAGCTTATTAGCTGCTGTACGGTAAAAGTGAGCTCTTACGGGAATGGGAATGTAGTTTTAGCCCT CCAGGGATTCTATTTAGCCCGCCAGGAATTAACCTTGACTATAAATAGGCCATCAATGACCTTTCCAGAG AATGTTCAGAGACCTCAACTTTGTTTAGAGATCTTGTGTGGGTGGAACTTCCTGTTTGCACACAGAGCAG CATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAGGGAGTAGGGTACAATACAGTC TGTTCTCCTCCAGCTCCTTCTTTCTGCAACATGGGGAAGAACAAACTCCTTCATCCAAGTCTGGTTCTTC TCCTCTTGGTCCTCCTGCCCACAGACGCCTCAGTCTCTGGAAAACCGCAGTATATGGTTCTGGTCCCCTC

grep -e "NM_000014.5" GRCh37_latest_genomic.gff | head

NC_000012.11 BestRefSeq mRNA 9220304 9268825 . - . ID=rna36219;Parent=gene24947;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Name=NM_000014.5;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

NC_000012.11 BestRefSeq exon 9268360 9268825 . - . ID=id345908;Parent=rna36219;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

NC_000012.11 BestRefSeq exon 9265956 9266139 . - . ID=id345909;Parent=rna36219;Dbxref=GeneID:2,Genbank:NM_000014.5,HGNC:HGNC:7,MIM:103950;Note=The RefSeq transcript has 1 substitution compared to this genomic sequence;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=A2M;product=alpha-2-macroglobulin%2C transcript variant 1;transcript_id=NM_000014.5

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

When I search for the transcripts manually in the quantifications , they do exist in tx2gene, but not sure why tximport did not see them. I think I will try re-annotating with Genecode and see if this will work.

Thanks

ADD REPLY • link 7.0 years ago by Tania ▴ 180

1

Entering edit mode

Okay, if you run into issues, then you can just download my own GENCODE tximport file from here.

I normally used this like:

TranscriptToGeneMap <- read.table("Library/TranscriptsTxImport.tsv", header=TRUE, sep="\t", skip=0, stringsAsFactors=FALSE)

#Read in the counts
#   txIn, input counts are transcript level?
#   txOut, output counts are transcript level?
txi <- tximport(files, type="kallisto", txIn=TRUE, txOut=FALSE, tx2gene=TranscriptToGeneMap, countsFromAbundance="no") #, reader=fread)

When I last used this, I was not aware of the makeTxDbFromGFF function, but it looks useful and may have been added since I first used tximport (2 year ago).

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

1

Entering edit mode

Thanks Kevin so much. I will try this and see hopefully

ADD REPLY • link 7.0 years ago by Tania ▴ 180

0

Entering edit mode

Hi @Kevin, The output looks like this, what is protein coding or rna associated with the gene. Do you have explanation? Sorry don't have too much biology background.

ADH7.protein_coding",-9.35112818723107,5.08564645523806,7.26455080231657e-24,8.23800060982699e-21
"TNR.protein_coding",6.76699157467661,5.56020033264359,5.98150908346311e-22,3.39151565032358e-19
"SCN3B.protein_coding",5.68195884233326,6.04515812873865,2.17776757060652e-21,8.23196141689263e-19
"CECR2.protein_coding",4.56376413767869,4.94697309522689,9.84755015258283e-21,2.79178046825723e-18
"SNORA70.snoRNA",2.46479498252603,8.222964993941,3.74758538408327e-08,1.41658727518348e-06
"RN7SKP106.misc_RNA",7.02192630891482,1.41996667770098,3.87278897378951e-08,1.41669119234752e-06
"CSRP1.protein_coding",-1.83461664661011,8.23257557071387,4.84852163343817e-08,1.71819485384965e-06

ADD REPLY • link 7.0 years ago by Tania ▴ 180

1

Entering edit mode

Oh, well, the GENCODE files to which I pointed you contain all of the RNA genes identified by the ENCODE project, which totals up to ~200,000 different RNAs. Only ~21,000 of these RNAs are actually further translated into proteins; hence, the 'protein_coding' suffix. These protein coding RNA genes comprise the most well known genes, such as the HOX gene group, TP53, BRCA1, CD20, CD79A/MS4A1, CR2, etc.

All of the other genes do not encode for a protein but may still have functionality, i.e., non-coding RNAs (ncRNAs). Well-known ones include XIST, a non-coding RNA which in females helps to deactivate one X chromosome through epigenetic silencing. Many other ncRNAs are transcribed from the antisense strand of another RNA and may interfere with the expression of the sense RNA. TSIX, for example, is transcribed on the opposite strand of XIST.

There is further information on all 'biotypes' in GENCODE here: https://www.gencodegenes.org/gencode_biotypes.html

You will most likely not be interested in all of these. For example, there are ~40,000 pseudogenes alone.