Question

tx2gene.txt : transcript-to-gene mapping file

0

Entering edit mode

19 months ago

tvibhaps ▴ 10

Hi, I am trying to quantify gene count from transcript abundance (from kallisto, salmon etc.) using Tximport. For that i have to create a transcript to gene mapping file. How can i create this? I created one with from GCF_013265735.2_USDA_OmykA_1.1_rna.fasta (Rainbow trout) fro ncbi as following-

grep '>'  GCF_013265735.2_USDA_OmykA_1.1_rna.fasta | awk -F '),' \ 
              '{if (NF==2) {print $1} else {print$1, $(NF-1)}}' | \
               awk '{print $1, $NF}' > tx2gene.txt

its head 5 counts are as below-

XM_036964134.1  LOC110513060
XM_036964135.1  LOC110513060
XM_036939654.1  LOC110539143
XM_036952300.1  LOC110495859

I was able to generate 119270 transcripts with their corresponding geneid(s) .

How can i know that i created it correctly? since there are several genes with multiple loci and several genes with multiple transcripts? Thanks in advance.

tx2gene mapping • 1.7k views

ADD COMMENT • link updated 19 months ago by Istvan Albert 101k • written 19 months ago by tvibhaps ▴ 10

0

Entering edit mode

Not sure what the question is. A gene should have multiple transcripts which is the reason of merging data at the gene level.

It basically sums up the transcripts for each gene.

Now if the question is whether your awk script is correct, that depends on what the FASTA header looks like there. You should post what you are parsing out, otherwise we can't tell.

ADD REPLY • link 19 months ago by Istvan Albert 101k

0

Entering edit mode

I understand that my awk script is correct. My fasta header header looks like -

NM_001124183.1 Oncorhynchus mykiss SRY-box transcription factor 11b (sox11b), mRNA , so i parsed accordingly what my script showed.

My main question is

whether i should extract trascript_ids and their mapped gene_ids from the transcriptome (rna.fasta) file or annotation (.gtf) file of genome? and
How do we make sure its correctness? since I used this (generated from rna.fasta having 119720 transcript ids and geneids) genemapping file for kallisto and got ~ 50,000 geecounts using tximport. On the other hand, ht-seq count and tximport from RSEM provides approx 70,000 genecounts for the same sample. RSEM does not need this maaping file and ht-seq count uses annotation file for the count.

Moreover, i created this mapping file using another way from annotation file of rainbow trout and have 131970 transcript ids and their geneids. Here is its process and output preview-

>gtf <- rtracklayer::import("GCF_013265735.2_USDA_OmykA_1.1_genomic.gtf.gz")
##Get a unique transcript2gene mapping table-
>tx2gene <- unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])

Here is the overview of file- each row tells the number of counts of the corresponding ids. transcript_id gene_id

1  XM_036976852.1 LOC110523811
9  XM_036976771.1 LOC110523811
18 XM_036976810.1 LOC110523811

I am going to run it for tximport run of my kallisto output and will see how may genecount i get comapred to ht-seq and tximport run of rsem output of the same sample. Thank you for your feedback and suggestions if any.

ADD REPLY • link updated 19 months ago by Istvan Albert 101k • written 19 months ago by tvibhaps ▴ 10

1

Entering edit mode

Download genome.fasta and genome.gtf (from ensembl), then run:

pip install kb_python

kb ref -i index.idx -g t2g.txt -f1 cdna.fa genome.fasta genome.gtf

The mapping file produced (t2g.txt) has always been reliable in my experience.

ADD REPLY • link 19 months ago by dsull ★ 6.9k

1

Entering edit mode

parsing the GTF is probably the most reliable way to connect the gene id to transcript id, you should probably make the gene_id, transcript_id pair unique (appear a single time)

ADD REPLY • link 19 months ago by Istvan Albert 101k