Question

Filter low quality reads in bam files

1

Entering edit mode

8.4 years ago

ammarsabir15 ▴ 70

I an trying to filter low quality reads in bam file using python's pysam. I have used the code from here.

I have modified this code a little and the whole code is shown below, but the code is not giving any bam file ,

import argparse,pysam,re,sys

def FilterReads(in_file, out_file):

    def read_ok(read):
          """
          read_ok - reject reads with a low quality (<30) base call
          read - a PySam AlignedRead object
          returns: True if the read is ok
          """
          if any([ord(c)-33 < _BASE_QUAL_CUTOFF for c in list(read.qual)]):
                 return False
          else:
                return True

_BASE_QUAL_CUTOFF = 30

bam_in = pysam.Samfile(in_file, 'rb')
bam_out = pysam.Samfile(out_file, 'wb', template=bam_in)

out_count = 0
for read in bam_in.fetch():
    if read_ok(read):
        bam_out.write(read)
        out_count += 1

print 'reads_written =', out_count

bam_out.close()
bam_in.close()


def GetArgs():
"""
GetArgs - read the command line
returns - an input bam file name and teh output filtered bam file
"""

def ParseArgs(parser):
    class Parser(argparse.ArgumentParser):
        def error(self, message):
            sys.stderr.write('error: %s\n' % message)
            self.print_help()
            sys.exit(2)

    parser = Parser(description='Calculate PhiX Context Specific Error Rates.')
    parser.add_argument('-b', '--bam_file',
                        type=str,
                        required=True,
                        help='Input Bam file.')
    parser.add_argument('-o', '--output_file',
                        type=str,
                        required=True,
                        help='Output Bam file.')
    return parser.parse_args()

parser = argparse.ArgumentParser()
args = ParseArgs(parser)

return args.bam_file, args.output_file


def Main():
        bam_file, output_file = GetArgs()
FilterReads(bam_file, output_file)

 if __name__ == '__main__':
       Main()

next-gen python pysam • 4.8k views

ADD COMMENT • link updated 8.4 years ago by Brian Bushnell 20k • written 8.4 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

I think you need to explain in more detail what you are trying to accomplish. Are you trying to reject any read with any base that has a quality score under 30? If so, I suggest you rethink your approach, because I can't imagine a scenario where that's a good idea.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Bushnell, Yes I want to reject reads having quality score < 30. I want to do it using python's pysam. I you have a better approach please share it.

ADD REPLY • link 8.4 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

No, I don't have a better approach, because I think removing any read with any base under Q30 is a terrible idea and will lead to extreme sequence-specific coverage bias. I have written tools to remove reads with average quality (based on expected error rates from the quality scores) below a certain level, but that should be used conservatively to avoid sequence-specific bias. I cannot imagine a scenario where it would be a good idea to remove every read that has a single base below a certain quality score, so I won't suggest a way to do it unless you can explain why you want to do so. I suggest you quality-trim the reads and reject reads with length under a specified value prior to mapping. You can do so like this, with the BBMap package (assuming interleaved reads):

bbduk.sh in=reads.fq out=trimmed.fq qtrim=r trimq=12 minlen=125

You can set trimq to 30 if you want, but again, I cannot imagine a situation where that would be a good idea. For this command, minlen should be set to some number less than your read length; e.g. for 150bp reads, maybe set it to 100.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

I am developing a variant caller for detecting heteroplasmy in ngs datasets which will take bam file as input according to workflow. As the link is not working so I have added an image , see the galaxy naive variant caller portion in that figure For calculating heteroplasmy they filtered bases having <30 score.

ADD REPLY • link 8.4 years ago by ammarsabir15 ▴ 70

0

Entering edit mode

It might be a terrible idea in some cases, but when working with ancient DNA, which might have sequencing errors due to oxidative and hidrolytic damage as to contamination, you might want to trim bases below certain quality (usually <30), anyways, i just write this in order to enrich the discussion, i don't know what kind of data ammarsabir is (was) working with.

if in further data handling you have a vcf step, it might be easier to just keep all the data in the bam file, and then trim the reads you want to using vcftools, like following:

vcftools --vcf yourfile.vcf --minQ 30 --recode --recode-INFO-all --out yourtrimmedfile

there's no need to put any extention in the last file name, since the tool writes a default extention.

cheers.

ADD REPLY • link 7.2 years ago by ricfoz ▴ 100

score 1 · Answer 1 · 2016-12-07

I recommend completely ignoring the Academy of Sciences, if they suggest ignoring all reads with a single base under Q30. That's a recipe for disaster, since all platforms have sequence-related quality; so, your filtered reads would be extremely biased.

You can, if you wish, trim low-quality reads with the BBMap package like this:

bbduk.sh in=reads.fq out=trimmed.fq qtrim=r trimq=12

That will optimally trim the low-quality tails of reads, while biasing the output as little as possible. It takes no special precautions to avoid bias, but none of the operations add nearly as much bias as a simple filter for reads with a single base under Q30 would. You can, of course, set trimq to 30 and get bad data, but I don't recommend it.