Hello everyone, I hope you are well.
I am writing this post because I have a question or rather I have a problem with my workflow.
Perform a workflow for RNA-seq processing as follows:
quality control - Hisat2 - Stringtie - Deseq2
A simple, normal workflow that threw me important differential expression data. However, when using Hisat2 and Stringtie, with Hisat2 I get .SAM files that I obviously compress with Samtools to .bam so that stringtie can work with them. Then Stringtie generates gtf output files for me.
In the gtf annotation file that Stringtie throws at me, there are obviously no sequences of the genes it is annotating. Stringtie assigns id to these genes and as I continue in my workflow, Deseq2 continues to use them.
Unfortunately, the annotation files can be limited and Stringtie simply assigns an ID's to a possible gene.
In Deseq2 I can do the differential expression analysis and it tells me which genes are overexpressing and which are not. But when I see which genes are the ones with the most activity, I see that there are the id assigned by Stringtie.
I would like to extract the sequence "fasta" of those ID's to carry out an alignment (it can be in blast) that tells me which gene would "be" presenting there.
I hope I'm not crazy and think that what I'm saying can be done.
What's the reference (fasta) file? Probably you can follow below approach:
It would help posting the data instead of explaining the data, to understand the issue.
the reference fasta file is version 3 of the canis lupus familiaris genome, this is located on the UCSC portal as well as the annotation file that appears there.
Ok, I'm going to document myself about getfasta to see how it goes and I'll tell you.
Thanks ;D
I think I could use the coordinates that Stringtie returns and use them in getfasta with the reference genome. I'll try. Thanks
Then you can use
gffread
(LINK) utility to extract transcript sequences with GTF file you get.Wow thank you, I had not seen that application on the CCB website. Thanks, I will also try this option.
The same here with AGAT: Extracting genomic feature sequences from GTF/GFF files with AGAT
There are many tools to perform this task