When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!
When I use samtools sort bam file, do i need to use samtool index the sorted bam file? what kind situation i need to use samtools index? Thanks!
when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file
Based on your explanation you do, the following, or something similar:
samtools view myfile.sam -o myfile.bam
samtools sort myfile.bam -o myfile_sorted.bam
That's not wrong, but it's also not necessary. Note that you can do the following in one go:
samtools sort myfile.sam -o myfile_sorted.bam
Finally, often you can also have your aligner write directly to samtools sort:
bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam
As such you can avoid quite some wasted time and intermediate files.
Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right?
A bam is a binary (compressed) format. For some applications, e.g. visualization in IGV or variant calling the tool needs 'random access' to the bam file, it needs to be able to find reads without iterating through the entire file. That's what the index is for.
We don't know what your "later analysis" is (please try to be as informative as possible when asking questions), but either it will work (no index required) or it will fail (and tell you it needs an index). Index files are generally small and created quickly, so there is no need to worry about them. Just index your bam.
How to define " coordinate sorted"?
A bam is coordinate sorted if the reads are sorted by coordinates. So reads from the beginning of the first chromosome are first in the file. That's what samtools sort
is for. The opposite would be 'sorted by name' in which reads are sorted by their read ID. Aligners also generate unsorted bams.
Please read How To Ask Good Questions On Technical And Scientific Forums. It will help you formulate your questions better, and maybe even it will help you to solve your problems during a more complete elaboration of your questions.
From the samtools man page:
index samtools index [-bc] [-m INT] aln.bam|aln.cram [out.index]
Index a coordinate-sorted BAM or CRAM file for fast random
access. (Note that this does not work with SAM files even if
they are bgzip compressed — to index such files, use tabix(1)
instead.)
This index is needed when region arguments are used to limit
samtools view and similar commands to particular regions of
interest.
when i get the mapped sam file, i use samtools view to transfer this sam file to bam file, then i use samtools sort this bam file. Depending on your reply, if i do not index the bam file it will not effect my later analysis( first condition), do i understand right? How to define " coordinate sorted"?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you for your nice reply. If i am right, once i get the "myfile_sorted.bam", I can filter it to get the mapped bam file? My later analysis is using picard to mark pcr duplicates.
That would be possible, but that's probably not necessary. I don't know what the aim is of your analysis though.
I do ChIP-seq analysis.Later i need the bam file to do peak calling.
You don't need to filter out unmapped reads, they'll be ignored by the peak caller (MACS2 will also ignore duplicates for you, so you don't actually need to mark them).
Yes, I prepare to use MACS2 to do peak calling. So your meaning is I just need sorted bam file(as below) to use MACS2?
$ bwa mem genome.fa reads.fastq | samtools sort -o myfile_sorted.bam
And do not underdstand why MACS2 can remove unmapped reads and duplicates, if i do not remove umapped reads and not mark the dups^^
I have another question. I see MACS2 process in two webpage. as follow : 1. https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/03_peak_calling_macs.html 2. https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands. do you think the first one can do all the things as the second one(the second has several subcommands and need to change certain parameters)?
Unmapped reads are marked in the file already, almost every program knows to skip them when appropriate. MACS2 keeps track of where reads start and internally tracks duplicates accordingly.
The link (callpeaks) is the only one you should typically use. It is itself going through the process described in the second link, which is something you should only ever do manually if absolutely needed.
Whether you filter blacklisted regions beforehand or not is up to you. You can also filter peaks overlapping blacklisted regions, which tends to be quicker.
Do you remove blacklist before you use MACS2?
I don't expect unmapped reads would interfere with that but I don't know the usual procedure of Chip-seq.