Hi guys,
I want some figures of tss/tes distribution so i mapped my reads from rna-seq on transcripts (22k transcripts). The transcripts fasta is downloaded from gencode and for each gene i selected the longest transcript isoform. I used transcript header as ID, so i could keep some useful information for downstream analysis.
But my pb is, when i check the .bed file (converted by bedtools), i found some mapping coordinates exceed the length of transcript, the transcript ENST00000623083.3
has 1687 bps and reads were mapped on position 2546-2621
:
ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding| **2546 2621** NB500892:13:HKCM7BGXX:1:12103:14291:17185 1 - 2546 2621 255,0,0 1 75 0
ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding| **2550 2625** NB500892:13:HKCM7BGXX:1:11204:2849:19931 1 - 2550 2625 255,0,0 1 75 0
ENST00000623083.3|ENSG00000279457.3|-|-|FO538757.1-202|FO538757.1|**1687**|protein_coding| **2564 2639** NB500892:13:HKCM7BGXX:2:13210:21760:16153 1 - 2564 2639 255,0,0 1 75 0
How could this happen? I am totally lost
rna-seq is single-end and i used default parameter for alignment.
Thank you in advance
DId you align in
--local
mode? I guess not, if you use default parameters, but worth asking.If not, perhaps you could check, for all the reads that map on that transcript into your SAM file, the rightmost mapping position (field 4+read_length). If the numbers exceed the transcript length also there, then it's a mapping problem. If they don't, it's a conversion problem.
To do so, a quick way would be:
(remove the "<>" signs when you insert values)
If the output number is 0, no read mapped out of bounds and the problem is your conversion. If the output number is > 0, then there are some mapping problems.
Thank you Macspider, I check my data, as you said it's some mapping problems. but I re-mapped read by using --local, problems remain the same.
Ok. Then I have a question: are you sure you are mapping your reads against the mRNA sequences, and not aganist the gene region sequence that codes for that mRNA? In the latter case, you will have a longer sequence because introns.
you are right, problem is from my reference transcript. The transcript i got was incorrect, some header was missing
thank you again Macspider