Few genes are poorly annotated in the database and there status are marked as predicted. I want to check the actual annotation of the genes. Also I want to look at, if any of the genes have isoforms.
To address the above points, I have developed RNA-seq data's from the developing tissues where the genes are expressed. Let us consider the RNA seq data as
1_R1, 1_R2
2_R1, 2_R2
3_R1, 3_R2
4_R1, 4_R2
5_R1, 5_R2
I have downloaded the genome file and the gtf file which are say
genome.fa
, genes.gtf
I have manually incorporated the predicted annotation of the genes of interest in the gtf file in proper format.
Next, I want to do mapping and assembly with Tophat and Cufflinks to address the above issue.
My Tophat command will be:
tophat -p 40 -G genes.gtf -o <tophat_output_file> <indexed-genome.fa> 1_R1 1_R2
(Same for the other four)
My cufflinks command will be
cufflinks \
-o <cufflink_output_file> \
-p 12 \
-g <genes.gtf> \
-b <genome_file> \
--max-bundle-frags 1000000000000 \
--multi-read-correct <tophat_output_.file-accepted.bam>
I will use IGV browser to visualize the mapped reads (tophat_output_.file-accepted.bam
). I will use it again to visualize and compare the original gtf file and the cufflinks gtf file. I will use the tophat junctions.bed
file to visualize the exon-exon junctions. I hope the comparison of original gtf and cufflinks gtf will help me to correctly annotate the genes, and 'junctions' will give clue of isoform of the genes if any. The expressions of the genes will be confirmed by the cufflinks genes.fpkm
file. The expression of the isoforms if any will be confirmed by the isoform.fpkm
file.
Whether, the tophat and cufflinks commands are okay to correctly annotate the genes and find isoform of the genes? Any suggestion will be highly appreciated.
Thanks. Do you have any idea how to filter the false junctions from the junction.bed file produced by Tophat?