After recently assembling a transcriptome, I took the reads used to generate the assembly and aligned them to the assembled contigs using BWA. The output of this process is a SAM file. I'd like to parse this SAM file to get a list of the # of reads map to each assembled contig. Before rolling my own script, I wanted to see if anyone is aware of a tool or easy method by which this information can be extracted.
Hello, (hopefully you will see this comment I know it's been like 8 years now since your answer was posted) but I wanted to ask you how did you set up your bwa-mem reference file so that the contigs (chromosomes in your case) had different lengths?
I'm running into a weird situation where I have a reference file with various CDS sequences all different lengths, yet when I run the samtools idxstats sp1.bam | head -n 5 command, the contigs are all the same length and always the length is for the longest CDS sequence in the file. Any ideas??
The contig ID is presumably in column 3 in the SAM file, so something like
cut -f3 file.sam | sort | uniq -c
should suffice, unless I misunderstand your question.
Now, this doesn't consider mapping qualities, SAM flags and so on, of course, but you can add a samtools call to the beginning and pipe the output into the "cut" command.
EDIT: If you have a BAM file (and a BAM index file) you can use "samtools idxstats" instead.
I use this all the time for sam files (pretty much the same as Mikael suggested)
grep -v "^@" file.sam | cut -f3 | sort | uniq -c | awk '{print $1 " " $2}'
You can pipe a sort after the awk to print by # matches: | sort -nk 1
Ah, I see what you mean. I was probably thinking of a SAM file without a header, which would produce only the desired output (number of reads per contig). However, if the SAM file has a header section, this would unfortunately also pick up the contig lengths from the header. It is easy to get around this as well on the command line, but probably simpler and safer to just use samtools idxstats or any of the other tools mentioned here. Sorry about that.
Agree with the above answers. But if you are working with BAM files generated using older versions of samtools (pre r595, circa June 2010), be aware that "samtools idxstats" and Picard's BamIdxStats will not correctly count reads. For example BamIndexStats reports zero reads:
Hello, (hopefully you will see this comment I know it's been like 8 years now since your answer was posted) but I wanted to ask you how did you set up your bwa-mem reference file so that the contigs (chromosomes in your case) had different lengths?
I'm running into a weird situation where I have a reference file with various CDS sequences all different lengths, yet when I run the samtools idxstats sp1.bam | head -n 5 command, the contigs are all the same length and always the length is for the longest CDS sequence in the file. Any ideas??