Hi all,
I am new to genomic analysis and to this website, so sorry in advance if this question has been answered before.
I have aligned a fastq file to the human genome using tophat. For the downstream analysis, I need to generate a bam file containing the reads of same length. How could I generate this file from the tophat output file (accepted_reads.bam
)?
Thanks so much!
Thanks a lot Ashutosh for your thoughts on this and for providing information about GATK. You are right about the alternative splicing software (rMATS). I actually did start with a fastq file where all the reads were of same length. But when I looked at the CIGAR values for bam file generated by tophat, they were not the same. Do you think this should be fine?
If you started with reads of same length then you should be fine. CIGAR strings may be different for reads with same length. For example, a 60 nt read may be represented by a cigar string of 1) "60M" in case no indel were introduced in the alignments or pure match or mostly matches and a few mismatches OR 2) "50M1I9M" in case there was an extra nucleotide present in the read wrt to the reference genome. So don't worry about it. You should be totally fine. I am moving all comments to answer, please accept them if you think you have got your answers.
Sounds good! Thanks again for all the information.