Dear all,
I used Bowtie to map RNA-seq reads on the genome, and got a sam file. I also have a LIST file with many ranges, below shows an example. I would like to exact the reads that fall into the ranges in the LIST file from the sam file.
LIST file
Chr1 + 2-999
Chr1 + 2999-5000
Chr3 + 90-2000
......
I am new user of Bowtie,and can't figure out the format of sam file. If I can get the read start and stop position, I can write a simple script to check if the reads locate at the ranges in the LIST file. Could anyone help? Certainly if there exist available tools, that will simplify the job. Thanks in advance!
Thanks a lot Brian, your comment is really helpful. Column 3 is reference ID, and column 4 is the read start position on the ref. I have a further question: what' s the read end position on the ref in SAM file? Thanks!
It's not in the required columns. You have to calculate it by parsing the cigar string for insertion and deletion symbols.
If you map using BBMap, you can use the
stoptag
flag to add the stop position to each read (prefixed byYS:i:
).Hi Brian, thanks for helpful answer. Can I ask one more question: I noticed Cigar strings have: M, I, D, P, N. To my understanding M means match, all the others are mismatches, INDELs. So I just need to make sum of the digits in front of the letters other than M, then plus the length of read and start position, this gives me the end position. Am I right? Thanks!
no. you need no skip somes bases with the clipping operator 'S' and 'H'. 'M' means that bases are aligned but there could be some mismatches (A vs G, G vs G...).
Hi, Pierre, Thanks for your comments, I need to be careful.