Entering edit mode
5.3 years ago
John
▴
270
Hi
I am using RSEM (with bowtie2) for alignment then gene count. Using Refseq Annotation (gff3), and genomic.fna reference Fasta file from NCBI. RSEM can convert gff3 to gtf file.
How can I subset the GTF file (or gff3 file) by gene a name. I want to extract the annotation (gtf) for particular gene and extract the gene sequence from reference Fasta file. Then I want to perform alignment.
This is especially to reduce time by avoiding aligning whole genome.
Thanks in anticipation.
This could potentially force some reads to be aligned to your gene, which would have normally aligned somewhere else.
That's what happened. There are more reads than I expected.
You should not do that! Aligning to only your genes will bias the analysis as your RNASeq experiment reflect the entire transcriptome not just your gene.
Yes, just switch to pseudo-aligners if you want to increase the speed. That's sufficient for gene expression
Can't you just grep for the gene name of interest and redirect the output to a file? All the lines relevant to that gene should have the ID, and this would select and place all lines with the given gene id into a single file.
If you are just interested in gene expression, you could speed up your analysis if you use pseudo-aligner like salmon, which are much faster than "real" aligner programms.
Or if you really need the nucleotide precise alignment, than I would use STAR, which is a little faster and has a higher fidelity.
Edit: I moved it into the comments, but I adressed the issue of running time, since the overall question was how to speed up the alignment process.
Could you rewrite this answer to address OPs question about gtf files