Question

samtools sort bam file

3

Entering edit mode

6.9 years ago

mikysyc2016 ▴ 120

When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!

ChIP-Seq • 82k views

ADD COMMENT • link updated 6.9 years ago by WouterDeCoster 48k • written 6.9 years ago by mikysyc2016 ▴ 120

0

Entering edit mode

6.9 years ago

h.mon 35k

Please read How To Ask Good Questions On Technical And Scientific Forums. It will help you formulate your questions better, and maybe even it will help you to solve your problems during a more complete elaboration of your questions.

From the samtools man page:

 index     samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]

     Index  a  coordinate-sorted  BAM or CRAM file for fast random
     access.  (Note that this does not work with SAM files even if
     they are bgzip compressed — to index such files, use tabix(1)
     instead.)

     This index is needed when region arguments are used to  limit
     samtools  view  and similar commands to particular regions of
     interest.

ADD COMMENT • link 6.9 years ago by h.mon 35k

0

Entering edit mode

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file. Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right? How to define " coordinate sorted"?

ADD REPLY • link 6.9 years ago by mikysyc2016 ▴ 120

score 28 · Accepted Answer · 2018-06-09

28

Entering edit mode

6.9 years ago

WouterDeCoster 48k

when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file

Based on your explanation you do, the following, or something similar:

samtools view myfile.sam -o myfile.bam
samtools sort myfile.bam -o myfile_sorted.bam

That's not wrong, but it's also not necessary. Note that you can do the following in one go:

samtools sort myfile.sam -o myfile_sorted.bam

Finally, often you can also have your aligner write directly to samtools sort:

bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

As such you can avoid quite some wasted time and intermediate files.

Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right?

A bam is a binary (compressed) format. For some applications, e.g. visualization in IGV or variant calling the tool needs 'random access' to the bam file, it needs to be able to find reads without iterating through the entire file. That's what the index is for.

We don't know what your "later analysis" is (please try to be as informative as possible when asking questions), but either it will work (no index required) or it will fail (and tell you it needs an index). Index files are generally small and created quickly, so there is no need to worry about them. Just index your bam.

How to define " coordinate sorted"?

A bam is coordinate sorted if the reads are sorted by coordinates. So reads from the beginning of the first chromosome are first in the file. That's what samtools sort is for. The opposite would be 'sorted by name' in which reads are sorted by their read ID. Aligners also generate unsorted bams.

ADD COMMENT • link 6.9 years ago by WouterDeCoster 48k

0

Entering edit mode

Thank you for your nice reply. If i am right, once i get the "myfile_sorted.bam", I can filter it to get the mapped bam file? My later analysis is using picard to mark pcr duplicates.

ADD REPLY • link 6.9 years ago by mikysyc2016 ▴ 120

0

Entering edit mode

I can filter it to get the mapped bam file?

That would be possible, but that's probably not necessary. I don't know what the aim is of your analysis though.

ADD REPLY • link 6.9 years ago by WouterDeCoster 48k

0

Entering edit mode

I do ChIP-seq analysis.Later i need the bam file to do peak calling.

ADD REPLY • link 6.9 years ago by mikysyc2016 ▴ 120

1

Entering edit mode

You don't need to filter out unmapped reads, they'll be ignored by the peak caller (MACS2 will also ignore duplicates for you, so you don't actually need to mark them).

ADD REPLY • link 6.9 years ago by Devon Ryan 105k

0

Entering edit mode

Yes, I prepare to use MACS2 to do peak calling. So your meaning is I just need sorted bam file(as below) to use MACS2?

$ bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam

And do not underdstand why MACS2 can remove unmapped reads and duplicates, if i do not remove umapped reads and not mark the dups^^

I have another question. I see MACS2 process in two webpage. as follow : 1. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/03_peak_calling_macs.html 2. https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands. do you think the first one can do all the things as the second one(the second has several subcommands and need to change certain parameters)?

ADD REPLY • link 6.9 years ago by mikysyc2016 ▴ 120

1

Entering edit mode

Unmapped reads are marked in the file already, almost every program knows to skip them when appropriate. MACS2 keeps track of where reads start and internally tracks duplicates accordingly.

The link (callpeaks) is the only one you should typically use. It is itself going through the process described in the second link, which is something you should only ever do manually if absolutely needed.

Whether you filter blacklisted regions beforehand or not is up to you. You can also filter peaks overlapping blacklisted regions, which tends to be quicker.

ADD REPLY • link 6.9 years ago by Devon Ryan 105k

0

Entering edit mode

Do you remove blacklist before you use MACS2?

ADD REPLY • link 6.9 years ago by mikysyc2016 ▴ 120

0

Entering edit mode

I don't expect unmapped reads would interfere with that but I don't know the usual procedure of Chip-seq.

ADD REPLY • link 6.9 years ago by WouterDeCoster 48k