I an trying to filter low quality reads in bam file using python's pysam. I have used the code from here.
I have modified this code a little and the whole code is shown below, but the code is not giving any bam file ,
import argparse,pysam,re,sys
def FilterReads(in_file, out_file):
def read_ok(read):
"""
read_ok - reject reads with a low quality (<30) base call
read - a PySam AlignedRead object
returns: True if the read is ok
"""
if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
return False
else:
return True
_BASE_QUAL_CUTOFF = 30
bam_in = pysam.Samfile(in_file, 'rb')
bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)
out_count = 0
for read in bam_in.fetch():
if read_ok(read):
bam_out.write(read)
out_count += 1
print 'reads_written =', out_count
bam_out.close()
bam_in.close()
def GetArgs():
"""
GetArgs - read the command line
returns - an input bam file name and teh output filtered bam file
"""
def ParseArgs(parser):
class Parser(argparse.ArgumentParser):
def error(self, message):
sys.stderr.write('error: %s\n' % message)
self.print_help()
sys.exit(2)
parser = Parser(description='Calculate PhiX Context Specific Error Rates.')
parser.add_argument('-b', '--bam_file',
type=str,
required=True,
help='Input Bam file.')
parser.add_argument('-o', '--output_file',
type=str,
required=True,
help='Output Bam file.')
return parser.parse_args()
parser = argparse.ArgumentParser()
args = ParseArgs(parser)
return args.bam_file, args.output_file
def Main():
bam_file, output_file = GetArgs()
FilterReads(bam_file, output_file)
if __name__ == '__main__':
Main()
I think you need to explain in more detail what you are trying to accomplish. Are you trying to reject any read with any base that has a quality score under 30? If so, I suggest you rethink your approach, because I can't imagine a scenario where that's a good idea.
Thanks Bushnell, Yes I want to reject reads having quality score < 30. I want to do it using python's pysam. I you have a better approach please share it.
No, I don't have a better approach, because I think removing any read with any base under Q30 is a terrible idea and will lead to extreme sequence-specific coverage bias. I have written tools to remove reads with average quality (based on expected error rates from the quality scores) below a certain level, but that should be used conservatively to avoid sequence-specific bias. I cannot imagine a scenario where it would be a good idea to remove every read that has a single base below a certain quality score, so I won't suggest a way to do it unless you can explain why you want to do so. I suggest you quality-trim the reads and reject reads with length under a specified value prior to mapping. You can do so like this, with the BBMap package (assuming interleaved reads):
bbduk.sh in=reads.fq out=trimmed.fq qtrim=r trimq=12 minlen=125
You can set trimq to 30 if you want, but again, I cannot imagine a situation where that would be a good idea. For this command, minlen should be set to some number less than your read length; e.g. for 150bp reads, maybe set it to 100.
I am developing a variant caller for detecting heteroplasmy in ngs datasets which will take bam file as input according to workflow. As the link is not working so I have added an image , see the galaxy naive variant caller portion in that figure For calculating heteroplasmy they filtered bases having <30 score.
It might be a terrible idea in some cases, but when working with ancient DNA, which might have sequencing errors due to oxidative and hidrolytic damage as to contamination, you might want to trim bases below certain quality (usually <30), anyways, i just write this in order to enrich the discussion, i don't know what kind of data ammarsabir is (was) working with.
if in further data handling you have a vcf step, it might be easier to just keep all the data in the bam file, and then trim the reads you want to using vcftools, like following:
there's no need to put any extention in the last file name, since the tool writes a default extention.
cheers.