Entering edit mode
19 months ago
tvibhaps
▴
10
Hi, I am trying to quantify gene count from transcript abundance (from kallisto, salmon etc.) using Tximport. For that i have to create a transcript to gene mapping file. How can i create this? I created one with from GCF_013265735.2_USDA_OmykA_1.1_rna.fasta (Rainbow trout) fro ncbi as following-
grep '>' GCF_013265735.2_USDA_OmykA_1.1_rna.fasta | awk -F '),' \
'{if (NF==2) {print $1} else {print$1, $(NF-1)}}' | \
awk '{print $1, $NF}' > tx2gene.txt
its head 5 counts are as below-
XM_036964134.1 LOC110513060
XM_036964135.1 LOC110513060
XM_036939654.1 LOC110539143
XM_036952300.1 LOC110495859
I was able to generate 119270 transcripts with their corresponding geneid(s) .
How can i know that i created it correctly? since there are several genes with multiple loci and several genes with multiple transcripts? Thanks in advance.
Not sure what the question is. A gene should have multiple transcripts which is the reason of merging data at the gene level.
It basically sums up the transcripts for each gene.
Now if the question is whether your awk script is correct, that depends on what the FASTA header looks like there. You should post what you are parsing out, otherwise we can't tell.
I understand that my awk script is correct. My fasta header header looks like -
My main question is
Moreover, i created this mapping file using another way from annotation file of rainbow trout and have 131970 transcript ids and their geneids. Here is its process and output preview-
Here is the overview of file- each row tells the number of counts of the corresponding ids. transcript_id gene_id
I am going to run it for tximport run of my kallisto output and will see how may genecount i get comapred to ht-seq and tximport run of rsem output of the same sample. Thank you for your feedback and suggestions if any.
Download genome.fasta and genome.gtf (from ensembl), then run:
The mapping file produced (t2g.txt) has always been reliable in my experience.
parsing the GTF is probably the most reliable way to connect the gene id to transcript id, you should probably make the gene_id, transcript_id pair unique (appear a single time)