exact sam file information
2
0
Entering edit mode
10.1 years ago
biolab ★ 1.4k

Dear all,

I used Bowtie to map RNA-seq reads on the genome, and got a sam file. I also have a LIST file with many ranges, below shows an example. I would like to exact the reads that fall into the ranges in the LIST file from the sam file.

LIST file

Chr1  +  2-999
Chr1  +  2999-5000
Chr3  +  90-2000
......

I am new user of Bowtie,and can't figure out the format of sam file. If I can get the read start and stop position, I can write a simple script to check if the reads locate at the ranges in the LIST file. Could anyone help? Certainly if there exist available tools, that will simplify the job. Thanks in advance!

sam bowtie • 3.1k views
ADD COMMENT
1
Entering edit mode
10.1 years ago

http://samtools.github.io/hts-specs/SAMv1.pdf

To be succinct - look at column 3 and 4 for the position information.

ADD COMMENT
0
Entering edit mode

Thanks a lot Brian, your comment is really helpful. Column 3 is reference ID, and column 4 is the read start position on the ref. I have a further question: what' s the read end position on the ref in SAM file? Thanks!

ADD REPLY
1
Entering edit mode

It's not in the required columns. You have to calculate it by parsing the cigar string for insertion and deletion symbols.

If you map using BBMap, you can use the stoptag flag to add the stop position to each read (prefixed by YS:i:).

ADD REPLY
0
Entering edit mode

Hi Brian, thanks for helpful answer. Can I ask one more question: I noticed Cigar strings have: M, I, D, P, N. To my understanding M means match, all the others are mismatches, INDELs. So I just need to make sum of the digits in front of the letters other than M, then plus the length of read and start position, this gives me the end position. Am I right? Thanks!

ADD REPLY
1
Entering edit mode

no. you need no skip somes bases with the clipping operator 'S' and 'H'. 'M' means that bases are aligned but there could be some mismatches (A vs G, G vs G...).

ADD REPLY
0
Entering edit mode

Hi, Pierre, Thanks for your comments, I need to be careful.

ADD REPLY
0
Entering edit mode
10.1 years ago

Just change the format of your list file to match BED format (this is untested):

sed -e 's/\:/\t/' list.txt | awk 'BEGIN{OFS="\t"}{print $1, $3-1, $4,".",".",$2}' > list.bed

Convert your SAM file to BAM, sort and index it with samtools and then just:

samtools view -b -L list.bed sorted.bam > filtered.bam

filtered.bam now contains all of the alignments overlapping the regions in your list.

BTW, don't use bowtie to map possibly spliced reads to the genome, the results won't be very good.

ADD COMMENT
0
Entering edit mode

Hi Devon Thanks for your suggestions that are very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 3368 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6