What we get from a gtf annotation file is like this:
chr1 hg19_refFlat exon 11874 12227 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 12613 12721 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 13221 14409 0.000000 + . gene_id "DDX11L1"; transcript_id "DDX11L1";
chr1 hg19_refFlat exon 14362 14829 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 14970 15038 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
chr1 hg19_refFlat exon 15796 15947 0.000000 - . gene_id "WASH7P"; transcript_id "WASH7P";
My question is about the strand information. Does the "+" means that the reference sequence is forward strand (does also mean the same strand as mRNA or coding sequence or sense strand)? While "-" means the reference sequence is reverse sequence, and the 'real' gene sequence is reverse complement?
In other words, does the reference sequence always in one strand (+?), or in forward or reverse strand depends on "+" and "_" in the annotation?
I'm interested in this because I want to make sure the SNPs in the vcf file is on the forward or reverse strand. For example, a T > C conversion, is this happens on sense strand or ant-sense strand.
Thanks for any help,
Jun
Thanks!
Does sequence given by hg19 reference genome always forward/+ or the same with coding sequence (mRNA sequence)?
Basically, if I see a T on reference genome in CDS, does this means the in the mRNA sequence this is also a T (forget SNP in this case), or it also possible be a A depends on "+" or "-" annotation in gtf file.
I'm not sure I understand what you mean. If there is a + behind a gene in the GTF file then that means that the gene / mRNA / protein is on the forward strand of the genome. As the sequence of the genome is normally only given as the forward strand, in this case gene sequence and genome sequence will be identical. If there is a - behind a gene in the GTF file then that means that the gene / mRNA / protein is on the reverse strand of the genome. In this case you have to reverse complement the genome sequence to get the gene sequence. Is that what you meant?
Thanks! Indeed the answer I'm looking for everywhere. Thanks again.