Question

exact sam file information

0

Entering edit mode

10.1 years ago

biolab ★ 1.4k

Dear all,

I used Bowtie to map RNA-seq reads on the genome, and got a sam file. I also have a LIST file with many ranges, below shows an example. I would like to exact the reads that fall into the ranges in the LIST file from the sam file.

LIST file

Chr1  +  2-999
Chr1  +  2999-5000
Chr3  +  90-2000
......

I am new user of Bowtie,and can't figure out the format of sam file. If I can get the read start and stop position, I can write a simple script to check if the reads locate at the ranges in the LIST file. Could anyone help? Certainly if there exist available tools, that will simplify the job. Thanks in advance!

sam bowtie • 3.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by biolab ★ 1.4k

Ram · Accepted Answer · 2014-11-25

1

Entering edit mode

10.1 years ago

Brian Bushnell 20k

http://samtools.github.io/hts-specs/SAMv1.pdf

To be succinct - look at column 3 and 4 for the position information.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks a lot Brian, your comment is really helpful. Column 3 is reference ID, and column 4 is the read start position on the ref. I have a further question: what' s the read end position on the ref in SAM file? Thanks!

ADD REPLY • link 10.1 years ago by biolab ★ 1.4k

1

Entering edit mode

It's not in the required columns. You have to calculate it by parsing the cigar string for insertion and deletion symbols.

If you map using BBMap, you can use the stoptag flag to add the stop position to each read (prefixed by YS:i:).

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, thanks for helpful answer. Can I ask one more question: I noticed Cigar strings have: M, I, D, P, N. To my understanding M means match, all the others are mismatches, INDELs. So I just need to make sum of the digits in front of the letters other than M, then plus the length of read and start position, this gives me the end position. Am I right? Thanks!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.1 years ago by biolab ★ 1.4k

1

Entering edit mode

no. you need no skip somes bases with the clipping operator 'S' and 'H'. 'M' means that bases are aligned but there could be some mismatches (A vs G, G vs G...).

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi, Pierre, Thanks for your comments, I need to be careful.

ADD REPLY • link 10.1 years ago by biolab ★ 1.4k

Ram · Accepted Answer · 2014-11-26

0

Entering edit mode

10.1 years ago

Devon Ryan 105k

Just change the format of your list file to match BED format (this is untested):

sed -e 's/\:/\t/' list.txt | awk 'BEGIN{OFS="\t"}{print $1, $3-1, $4,".",".",$2}' > list.bed

Convert your SAM file to BAM, sort and index it with samtools and then just:

samtools view -b -L list.bed sorted.bam > filtered.bam

filtered.bam now contains all of the alignments overlapping the regions in your list.

BTW, don't use bowtie to map possibly spliced reads to the genome, the results won't be very good.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon Thanks for your suggestions that are very helpful.

ADD REPLY • link 10.1 years ago by biolab ★ 1.4k