Firs of all, I'm aware I have seen this topic in several web pages but I don't understand the correct procedure and the output differs from between tools.
Could someone give me a hand to understand it or give me hints on what I am doing wrong, please?
My data: BAM files with overlaying (desired) regions from a BED file, extracted from another BAM file previously mapped to a reference and sorted, Paired-end data. I used samtools view -h -b -L regions_of_interest.bed sample_mapped_sorted.bam > Sample_regions_selected.bam
I took one file as example. Using samtools flagstat
on this file the output is:
26169 + 0 in total (QC-passed reads + QC-failed reads)
955 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
25876 + 0 mapped (98.88% : N/A)
25214 + 0 paired in sequencing
12590 + 0 read1
12624 + 0 read2
21966 + 0 properly paired (87.12% : N/A)
24628 + 0 with itself and mate mapped
293 + 0 singletons (1.16% : N/A)
2311 + 0 with mate mapped to a different chr
1019 + 0 with mate mapped to a different chr (mapQ>=5)
Now, I want to obtain fastq sequences from this file. I have tried 4 tools, so far:
1.Bamtools
bamtools convert
But, It only put all the read in one file. Although it put All reads.
I used egrep '@' Sample1_regions_selected.bam | wc -l
= 26169. Same number reported by samtools.
2. Samtools
samtools bam2fq -1 Pair1_samtools.fastq -2 Pair2_samtools.fastq -s UNpaired_samtools.fastq Sample_regions_selected.bam
The results from each file are:
egrep '@' Pair1_samtools.fastq | wc -l => 379
egrep '@' Pair2_samtools.fastq | wc -l => 379
egrep '@' UNpaired_samtools.fastq | wc -l => 24456
The sum of this is 25214, then I have 955 missing reads, that I don't understand how to obtain.
3.BEDtools bedtools bamtofastq -i Sample_regions_selected_SbyNAme.bam -fq Pair1_bedtools.fastq -fq2 Pair2_bedtools.fastq
Previously, BAM file was sorted by name , as it is required.samtools sort -n Sample_regions_selected.bam -o Sample_regions_selected_SbyNAme.bam
The error file is still showing a warning note in 25376 lines e.g.:
*WARNING: Query D00564:54:C93FEANXX:4:1210:13236:29478 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
And, It doesn't have the option to output unpaired reads (or I don't know how to do it).
Now, the data obtained in fastq files:
egrep '@' Pair1_bedtools.fastq | wc -l => 11470
egrep '@' Pair2_bedtools.fastq | wc -l => 11470
Total number of reads = 22940. Again reads missing = 3229.
4.Picard
I used:
java -jar -Xmx17G picard.jar SamToFastq \
INPUT=Sample_regions_selected.bam \
FASTQ=Pair1_picard.fastq \
SECOND_END_FASTQ=Pair2_picard.fastq \
UNPAIRED_FASTQ=UNpaired_picard.fastq \
INCLUDE_NON_PF_READS=true \
VALIDATION_STRINGENCY=LENIENT`
So: because of the last flag the "Error" got was "Ignoring SAM validation error: ERROR: Found 2524 unpaired mates"
And the reads obtained were:
egrep '@' Pair1_picard.fastq | wc -l => 11345
egrep '@' Pair2_picard.fastq | wc -l => 11345
egrep '@' UNpaired_picard.fastq | wc -l => 0
For a total of 22690 paired-end reads, and if you sum the ERROR unpaired mates this will give 25214 so there is still the 955 reads "missing". And but I don't know how to obtain them... Perhaps, I should also do a sorting by name before input the BAM file?
Thanks in advance to anyone how can help me to understand if this procedure was right, I miss something, or just this is the output....
Cheers.
Hi Chris,
Just to check whether I understand the procedure well. What I should do is: 1). Obtain the ID of the reads in the desired region from the bam file. 2). Look for (extract) those IDs from the original bam file that contains all the mapped reads. 3). And finally use one of the conversion tools that I tried above to obtain the fastq files (but you recommended bedtools, with a bam name-sorted)
If so... I did it... then the new bam file with
samtools flagstat
is:30712 + 0 in total (QC-passed reads + QC-failed reads) 1938 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 30398 + 0 mapped (98.98% : N/A) 28774 + 0 paired in sequencing 14387 + 0 read1 14387 + 0 read2 22510 + 0 properly paired (78.23% : N/A) 28146 + 0 with itself and mate mapped 314 + 0 singletons (1.09% : N/A) 5196 + 0 with mate mapped to a different chr 2352 + 0 with mate mapped to a different chr (mapQ>=5)
As you recommended I used bedtools, It still show the error. But I checked the files and the reads are included.
And the same number for the R2.
I think this is a GREAT approach to include all the paired-end in a desired region. And thank you!!! for pushing me in this direction.
However, I'm still wonder why still the number of reads obtained with flagstat differs from the number of reads in the fastaq file!
Sounds like you are still skipping reads that have only one of the pairs, then. You can either put those back in manually (by finding read names that only appear once, then converting to a non-paired fastq file to go along with your paired data), or try to figure out why they're still missing. Are those read names paired in your original bam file?
Hi Chis, I think, I found the issue in bedtools. It seems that, if the read was mapped in multiple locations, and the number of times this read appears is higher than double (e.i 4, 5..etc). Bedtools interpret them as "another" paired-end couple. And therefore, It appears duplicated in the sample.
In my case, the maximum number of times a read was counted (aligned) was 5, and I found 138 duplicated reads. I'm pretty sure that, if I would had a read aligned 6 or 7 times, this would appear double duplicated in the bedtools file.
I'm going to use picard or TopHat to convert from bam, to fastq.
Again, Thanks so much for your help!!
Makes sense - good detective work!
Yes, it weird.
I check the bam file with
samtools view subset_complete_sorted.bam | cut -f1 > readnames
and then sort anduniq -u readnames
and It didn't throw out anything.I also run Picard and this time the "unpaired mates ERROR" didn't showed up. But the Number of reads are the same as the samtools flagstat said 14387 in both files. There are 138 reads that BEDtools is "over-recognizing".