I am wondering whether there is a way to filter the reads in the sam/bam files by the alignment direction using samtools or other tools? for example, I want to filter out all the reads facing to each other (-> <-) in the mapping result of a mate pair library.
yes, since in principle, the reads in the mate pair library should be <- ->. However, some paired end reads with direction of -> <- were sequenced. For my research, I just want to use the 'true mate pair reads'. I want to filter out those reads are PE reads based on their direction
I am not sure what you mean here. I used bwa sampe for the mapping and the results from bwa were first converted by bam format and sorted by default parameters of samtools
Well you cannot rely on the flags, that we know. not if you do samtools sort on read names, not coordinate, you will have every pair with its mate immediately after it. Then you can figure out which are within reasonable distance plus opposite strands.
There's no need to resort for that. Just check the flag (for orientation), rnext (to ensure the same chromosome) and pnext (mapping position) fields. This would be trivial to script.
yes, it is quite tricky to estimate the insertion size, for some libraries, the insertion size estimated by bwa is around 3 kb which fits our expectation. However, for some libraries, the range of the insertion size is quite big, from few hundreds bp to 5 kb. This may indicate the mate pair library has lots of PE reads.
The following program is an example of an efficient approach to use, using the samtools C API. The program itself is just a quick hack, but you can use it to get started. The program takes a SAM or BAM input and will write accepted reads to a file called "filtered.bam". Reads are accepted according to the following criterion: (1) a read and its mate must map to the same chromosome/contig, (2) the mates must have opposite orientations, and (3) the mates must point away from each other. I've added some comments, in case you're not very familiar with the API or C. You'd want to make this more robust if you were to actually use it!
#include "sam.h"
int main(int argc, char* argv[]) {
samfile_t *ifile = NULL, *ofile = NULL;
bam1_t *read = bam_init1();
int keep = 0;
char *p = NULL;
//Open input file, either SAM or BAM
p = strrchr(argv[1], '.');
if(strcmp(p, ".bam") == 0) {
ifile = samopen(argv[1], "rb", NULL);
} else {
ifile = samopen(argv[1], "r", NULL);
}
//Open output file
ofile = samopen("filtered.bam", "wb", ifile->header);
//Iterate through the lines
while(samread(ifile, read) > 1) {
keep = 0;
//Is the read's mate on the same chromosome/contig?
if(read->core.tid == read->core.mtid) {
//Are the mates on opposite strands?
if(read->core.flag & BAM_FREVERSE && !(read->core.flag & BAM_FMREVERSE)) {
if(read->core.pos < read->core.mpos) keep=1;
} else if(!(read->core.flag & BAM_FREVERSE) && read->core.flag & BAM_FMREVERSE) {
if(read->core.mpos < read->core.pos) keep=1;
}
}
if(keep) samwrite(ofile, read);
}
bam_destroy1(read);
samclose(ifile);
samclose(ofile);
return 0;
}
Thank you for the script. I think this will serve my purpose to filter the reads from bam file. But I am not able to run the script. Can you tell me what are the command line arguments we need to pass to run this program.
you mean, face each other and they are pairs ? Is your bam file sorted by read name ?
yes, since in principle, the reads in the mate pair library should be <- ->. However, some paired end reads with direction of -> <- were sequenced. For my research, I just want to use the 'true mate pair reads'. I want to filter out those reads are PE reads based on their direction
have they been sorted on a per name basis ?
I am not sure what you mean here. I used bwa sampe for the mapping and the results from bwa were first converted by bam format and sorted by default parameters of samtools
Well you cannot rely on the flags, that we know. not if you do samtools sort on read names, not coordinate, you will have every pair with its mate immediately after it. Then you can figure out which are within reasonable distance plus opposite strands.
There's no need to resort for that. Just check the flag (for orientation), rnext (to ensure the same chromosome) and pnext (mapping position) fields. This would be trivial to script.
How do you get the orientation of the mate ?
The 0x20 flag status.
yes ! forgot that one !
@Shaohua can you script in pysam or code in bamtools ?
If the aligner that you used sets the "properly paired" flag correctly, then just filter by that "samtools -f 2 ...".
that depends on the size of the insert in between. I do not know how an aligner will do insert estimate for a mate pair lib.
yes, it is quite tricky to estimate the insertion size, for some libraries, the insertion size estimated by bwa is around 3 kb which fits our expectation. However, for some libraries, the range of the insertion size is quite big, from few hundreds bp to 5 kb. This may indicate the mate pair library has lots of PE reads.