Question

How to prepare a bam file containing reads of same length?

0

Entering edit mode

9.7 years ago

mmitra ▴ 60

Hi all,

I am new to genomic analysis and to this website, so sorry in advance if this question has been answered before.

I have aligned a fastq file to the human genome using tophat. For the downstream analysis, I need to generate a bam file containing the reads of same length. How could I generate this file from the tophat output file (accepted_reads.bam)?

Thanks so much!

RNA-Seq tophat Bam • 3.7k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by mmitra ▴ 60

Ram · Accepted Answer · 2015-08-29

2

Entering edit mode

9.7 years ago

Ashutosh Pandey 12k

I would discourage trimming reads in the bam file (post alignment). This is because it wont be as straightforward as trimming the read and the corresponding quality string, but you will also have to update the CIGAR string. I would suggest you to trim your reads to a fixed length before the alignment and realign them. That would be much easier. That being said, GATK ClipReads (see --cyclesToTrim option) may help you but I would still advise you to trim reads and realign. See below for the link to GATK: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_readutils_ClipReads.php

PS: I guess this is some alternative splicing detection software that requires bam files with fixed length sequences.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks a lot Ashutosh for your thoughts on this and for providing information about GATK. You are right about the alternative splicing software (rMATS). I actually did start with a fastq file where all the reads were of same length. But when I looked at the CIGAR values for bam file generated by tophat, they were not the same. Do you think this should be fine?

ADD REPLY • link 9.7 years ago by mmitra ▴ 60

0

Entering edit mode

If you started with reads of same length then you should be fine. CIGAR strings may be different for reads with same length. For example, a 60 nt read may be represented by a cigar string of 1) "60M" in case no indel were introduced in the alignments or pure match or mostly matches and a few mismatches OR 2) "50M1I9M" in case there was an extra nucleotide present in the read wrt to the reference genome. So don't worry about it. You should be totally fine. I am moving all comments to answer, please accept them if you think you have got your answers.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Sounds good! Thanks again for all the information.

ADD REPLY • link 9.7 years ago by mmitra ▴ 60