Question

Clipping reads within BAM files?

0

Entering edit mode

5.5 years ago

devenvyas ▴ 770

I am on paleogenomic data. I have BAM files from two different datasets that were processed/sequenced different (one was partially UDG treated, one was not UDG treated at all - the latter was sequenced on a newer machine).

Most of my comparisons between the two datasets are plagued by batch effects. One idea my adviser asked me to try is clip the first 5-10 bp off all reads within all BAM files. The problem is neither of us know how to do that.

Anyone here know of a way to trim reads within a BAM file?

bam • 2.2k views

ADD COMMENT • link updated 5.5 years ago by mbelmadani ★ 1.4k • written 5.5 years ago by devenvyas ▴ 770

score 0 · Answer 1 · 2019-11-11

0

Entering edit mode

5.5 years ago

mbelmadani ★ 1.4k

Do you have the FASTQ files? I'm asking because if the sequence do need clipping, then your BAM file possibly has reads that were unaligned or misaligned because clipping wasn't done. So why not clip the FASTQs and re-align?

If you don't, but still want to try, you could could use a tool to extract sequences from BAM files into a FASTQ file. There's a few tools if you Google or search this forum, but here's one from bedtools. Then use a trimming tool for FASTQ files, and re-align the clipped files into BAM files. If you have a BAM file for unaligned reads, use that too.

Is there a good indication the the batch effect has to do with the edges of the sequences? What's the evidence of it so far? If you know, update your question since it could be useful information. There's probably a few techniques to handle that depending what you're doing.

ADD COMMENT • link 5.5 years ago by mbelmadani ★ 1.4k

0

Entering edit mode

I do have the FASTQs. Adapters were trimmed. The idea here is to uniformly clip additional bases to make sure the damaged bases are removed.

We are not quite sure what is causing the batch effect. It could be quality score related. There is a lot more variation in the partially UDG set than in the later untreated set. When I look at rates of allele sharing, I get more sharing between batches than between individuals with similar ancestry between the two batches.

ADD REPLY • link 5.5 years ago by devenvyas ▴ 770

0

Entering edit mode

Try something like this on your FASTQ files: fastx-toolkit: fastx-trimmer , it'll let you remove the first/last N bases with -f and -l.

And then realign into BAM files and see if it helps.

Also you should run your FASTQ files through some tools like FastQC and aggregate your results with MultiQC. This will give you statistics on things like per base sequence content, over-represented sequences, quality score distributions etc.

An additional suggestion: You should identify variants that are shared within a batches, ideally a few examples like that ( of batch specific alleles.) Load up your BAM files in IGV, and go to the position of the variants. Do these variants tend to fall mostly in the first 5-10bp, or are they all over the reads? That might help you determine if clipping is appropriate. You could probably script it too by getting all reads spanning a batch specific variant from the BAM files, and computing the diffence between the allele position and the aligned read start distance, and see if the distribution is flat or skewed one way or another.

ADD REPLY • link 5.5 years ago by mbelmadani ★ 1.4k