One of the situation when bam to fastq conversion might be required:
Say, I have a problem where I want to take the reads that are aligned to chromosome 2 of one species. And, I want to see how much (fraction) of these same reads align to chr 2 of another species or population and with how much mismatches.
I used several method but wasn't able to get the number of reads in fastq files
as indicated in flagstat statistics
output. This bitwise method seems to be more appropriate, if people want to extract the exact amount of reads designated by samtools flagstat
; but if others have different opinion please suggest.
Step 01: Select reads from Chromosome 2
$ samtools view -h -b realigned_2ms04h.bam 2 > chr2.2ms04h.bam
# sort the bam alignment
$ samtools sort -n chr2.2ms04h.bam chr2.sorted.2ms04h.bam
# fix mate pairs/info
$ samtools fixmate chr2.sorted.2ms04h.bam chr2.fixmate.2ms04h.bam
# Let's check the simple statistics of the bam file
$ samtools flagstat chr2.fixmate.2ms04h.bam > chr2.fixmate.stats
# Stats output from "chr2.fixmate.stats" file:
7549722 + 0 in total (QC-passed reads + QC-failed reads)
1421823 + 0 duplicates
7549722 + 0 mapped (100.00%:-nan%)
7549722 + 0 paired in sequencing
3771662 + 0 read1
3778060 + 0 read2
6936646 + 0 properly paired (91.88%:-nan%)
6937102 + 0 with itself and mate mapped
612620 + 0 singletons (8.11%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Step 02: Convert the mate-fixed bam to fastq using one of the below methods
# A) using bam2fastq
$ bamToFastq -i chr2.fixmate.2ms04h.bam -fq chr2.2ms04h.R1 -fq2 chr2.2ms04h.R2
# B) using Picard
$ java -jar picard2.9.jar SamToFastq I=chr2.fixmate.2ms04h.bam FASTQ=R1.fq SECOND_END_FASTQ = R2.fq UNPAIRED_FASTQ = unPaired.fq
But, bam2fastq and picard
didn't work for me.
So, contd... Step 02 - C
If, you want to select only the properly paired reads use `bitwise flag -f 2`. if you want to remove duplicated reads use `bitwise flag -F 1024`
# 2-C (optional extra step) : Select properly paired (bitwise flag: -f 2) and remove duplicates (flag: -F 1024)
$ samtools view -b -f 2 -F 1024 chr2.fixmate.2ms04h.bam > chr2.Paired_DeDup.bam
$ samtools flagstat chr2.Paired_DeDup.bam > chr2.Paired_DeDup.stats
# Stats output from "chr2.Paired_DeDup.stats" file:
5629430 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
5629430 + 0 mapped (100.00%:-nan%)
5629430 + 0 paired in sequencing
3061503 + 0 read1
2567927 + 0 read2
5629430 + 0 properly paired (100.00%:-nan%)
5629430 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
# 2-C: Now, convert the bam files into Read1/2 matepairs, using bitwise flag
$ samtools view -uf64 chr2.Paired_DeDup.bam |samtools bam2fq - |gzip > chr2.2ms04h_R1.fq.gz
$ samtools view -uf128 chr2.Paired_DeDup.bam |samtools bam2fq - |gzip > chr2.2ms04h_R2.fq.gz
# Unzip the files and count the number of reads in R1, R2 read pairs
# Count
$ cat chr2.2ms04h_R1.fq | echo $((`wc -l`/4))
3061503
$ cat chr2.2ms04h_R2.fq | echo $((`wc -l`/4))
2567927
- You can see that the number of reads reported by
samtools flagstat
and reads (R1 and R2) obtained usingbitwise method
are the same. - If you didn't remove the the duplicates, singletons reads, those can also be extracted using bitwise flag and will be equal to the numbers reported in
samtools flagstat
file chr2.fixmate.stats.
Thanks for this tutorial! But do you know a way i can generate a file of only unpaired reads, based on the chr2.2ms04h_R1.fq and chr2.2ms04h_R2.fq files?