Hi all,
I am following hisat2, stringtie and Deseq2 pipeline. I obtained gene counts using HTSeq by aligning reads against stringtie merged gtf. I aligned reads with HISAT2 without reference GFF file (to avoid biasing alignments to annotated splice junctions). While mapping using Stringtie I used reference gff file.
When I see the results I could see that stringtie has assigned transcripts of two adjacent genes to same MSTRG ID.
For eg;
Genome GFF file from NCBI
NC_037550.1 Gnomon lnc_RNA 107569536 107577599 . - . ID=rna-XR_003110259.1;Parent=gene-LOC112585719;Dbxref=GeneID:112585719,Genbank:XR_003110259.1;Name=XR_003110259.1;gbkey=ncRNA;
NC_037550.1 Gnomon gene 107580147 107581316 . - . ID=gene-C6H1orf122;Dbxref=GeneID:102404938;Name=C6H1orf122;gbkey=Gene;gene=C6H1orf122;gene_biotype=protein_coding
Stringtie_merged GTF
NC_037550.1 StringTie transcript 107569504 107581424 1000 - . gene_id "MSTRG.10147"; transcript_id "MSTRG.10147.1";
So 1. Why does stringtie assign same ID to adjacent loci? Is that because I did not use reference annotation while aligning?
- How to quantify, when stringtie assigns reads of two adjacent loci to same ID?
Thank you for the link. I used
prepDE.py -i ballgown -g gene_count_matrix.csv -t transcript_count_matrix.csv
to calculate transcript-level expression count.Now I guess I should use
summarizeToGene
oftximport
package to summarize the counts to each gene. Can you please tell or attach a link on how to import transcript_count_matrix.csv into tximport to summarize the transcript counts?. There are no examples on how to import stringtie generated transcript-count-matrix.It is a 4 step opperation: 1) read the "transcript_count_matrix.csv" file into R with
read.table()
orread_tsv()
. 2) Read in the annotation (either using the same function as in step 1 or you can use rtracklayer::import() to import the merged GTF file. 3) Manually modify the annotation to split up the "merged" genes by giving them new gene-ids. 4) UseIsoformSwitchAnalyzeR::isoformToGeneExp()
ortximport::summarizeToGene()
to summarize the isoform expression to gene expression.Doubt on assigning ids while manually modifying annotation to split the merged genes. For eg; for the transcript mentioned in my question, I planned to give MSTRG.10147 for transcripts falling between 107569536-107577599 and MSTRG.10147a for transcripts falling between 107580147-107581316
But there seems to be overlapping transcripts also in the merged gtf and I don't know how to name them.
Pl. suggest me how to handle those transcripts
I do not know how to handle transcripts that span both of them - which is the reason why StringTie merged the genes in the first place. There are two possibilities: random assignment or removal but before you start doing something like that I'd suggest you take a look at the raw data. A efficient way of doing that is loading it into a genome browser - there are many out there but I usually use the IGV browser since it is local and directly supports loading of bam and GTF files so you can take a closer look at the data to figure out if it is a problem with StringTie or the data actually supports this novel transcript. Also note that StringTie2 was just released and should be a good deal better so re-running with StringTie2 is also a possibility.
There is a same geneid for a region spanning 53kb.
There are different gene_ids in individual sample gtfs. Only in the merged gtf file the MSTRG IDS are same for few loci.
Previously I had used the following commands to generate sample gtfs and merged gtf. Kindly tell me if the options I used are ok.
StringTie version 1.3.5 Read length 101 bp.
./stringtie -G UOA_WB_1_genomic.gff sam_sorted.bam -o output.gtf -A gene_abund.tab -C cov_ref.gtf -v
./stringtie --merge -G UOA_WB_1_genomic.gff -c 3 -T 1 -o stringtiemerged.gtf mergelist.txt
Loos good to me - the only thing you could try was increasing -f to 0.05 in the --merge run. But again it might be that there are RNAseq reads joining the two regions together - you will not know unless you look at the raw - which also mean that you still actually don't know if it is an error or not - could simply be the known annotation that was lacking...
Thank you for your advise. I viewed the raw data in IGV. I found that there are splice junctions connecting the two loci and in some cases there are overlapping genes. There are around 121 such loci. Is it ok to perform the differential expression analysis and report these merged loci with co-ordinates if they are differentially expressed?
Nice you figured it out. And yes that is okay :-)
Thank you for your patient suggestions.