BAM to Fastq using picard and others from a desired regions
1
1
Entering edit mode
8.0 years ago

Firs of all, I'm aware I have seen this topic in several web pages but I don't understand the correct procedure and the output differs from between tools.

Could someone give me a hand to understand it or give me hints on what I am doing wrong, please?


My data: BAM files with overlaying (desired) regions from a BED file, extracted from another BAM file previously mapped to a reference and sorted, Paired-end data. I used samtools view -h -b -L regions_of_interest.bed sample_mapped_sorted.bam > Sample_regions_selected.bam

I took one file as example. Using samtools flagstat on this file the output is:

26169 + 0 in total (QC-passed reads + QC-failed reads)
955 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
25876 + 0 mapped (98.88% : N/A)
25214 + 0 paired in sequencing
12590 + 0 read1
12624 + 0 read2
21966 + 0 properly paired (87.12% : N/A)
24628 + 0 with itself and mate mapped
293 + 0 singletons (1.16% : N/A)
2311 + 0 with mate mapped to a different chr
1019 + 0 with mate mapped to a different chr (mapQ>=5)


Now, I want to obtain fastq sequences from this file. I have tried 4 tools, so far:

1.Bamtools
bamtools convert
But, It only put all the read in one file. Although it put All reads.
I used egrep '@' Sample1_regions_selected.bam | wc -l = 26169. Same number reported by samtools.

2. Samtools samtools bam2fq -1 Pair1_samtools.fastq -2 Pair2_samtools.fastq -s UNpaired_samtools.fastq Sample_regions_selected.bam The results from each file are:

egrep '@' Pair1_samtools.fastq | wc -l            => 379
egrep '@' Pair2_samtools.fastq | wc -l            => 379
egrep '@' UNpaired_samtools.fastq | wc -l         => 24456

The sum of this is 25214, then I have 955 missing reads, that I don't understand how to obtain.

3.BEDtools bedtools bamtofastq -i Sample_regions_selected_SbyNAme.bam -fq Pair1_bedtools.fastq -fq2 Pair2_bedtools.fastq Previously, BAM file was sorted by name , as it is required.samtools sort -n Sample_regions_selected.bam -o Sample_regions_selected_SbyNAme.bam The error file is still showing a warning note in 25376 lines e.g.:

*WARNING: Query D00564:54:C93FEANXX:4:1210:13236:29478 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

And, It doesn't have the option to output unpaired reads (or I don't know how to do it).
Now, the data obtained in fastq files:

egrep '@' Pair1_bedtools.fastq | wc -l              => 11470
egrep '@' Pair2_bedtools.fastq | wc -l              => 11470

Total number of reads = 22940. Again reads missing = 3229.

4.Picard
I used:

java -jar -Xmx17G picard.jar SamToFastq \
INPUT=Sample_regions_selected.bam \
FASTQ=Pair1_picard.fastq \
SECOND_END_FASTQ=Pair2_picard.fastq \
UNPAIRED_FASTQ=UNpaired_picard.fastq \
INCLUDE_NON_PF_READS=true \
VALIDATION_STRINGENCY=LENIENT`

So: because of the last flag the "Error" got was "Ignoring SAM validation error: ERROR: Found 2524 unpaired mates"
And the reads obtained were:

egrep '@' Pair1_picard.fastq | wc -l          => 11345
egrep '@' Pair2_picard.fastq | wc -l          => 11345
egrep '@' UNpaired_picard.fastq | wc -l          => 0

For a total of 22690 paired-end reads, and if you sum the ERROR unpaired mates this will give 25214 so there is still the 955 reads "missing". And but I don't know how to obtain them... Perhaps, I should also do a sorting by name before input the BAM file?

Thanks in advance to anyone how can help me to understand if this procedure was right, I miss something, or just this is the output....

Cheers.

next-gen alignment gene • 3.8k views
ADD COMMENT
2
Entering edit mode
8.0 years ago

If I understand your question correctly, it boils down to "How can I extract all reads within a region and their paired reads (which may be mapped elsewhere)?"

See these two previous answers:

Once you have that bam (and have name-sorted it), I find that bedtools is often the simplest way to convert bam to fastq (with bedtools bamtofastq). It should not throw errors because you will have both members of each pair.

ADD COMMENT
0
Entering edit mode

Hi Chris,

Just to check whether I understand the procedure well. What I should do is: 1). Obtain the ID of the reads in the desired region from the bam file. 2). Look for (extract) those IDs from the original bam file that contains all the mapped reads. 3). And finally use one of the conversion tools that I tried above to obtain the fastq files (but you recommended bedtools, with a bam name-sorted)

If so... I did it... then the new bam file with samtools flagstat is:

30712 + 0 in total (QC-passed reads + QC-failed reads) 1938 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 30398 + 0 mapped (98.98% : N/A) 28774 + 0 paired in sequencing 14387 + 0 read1 14387 + 0 read2 22510 + 0 properly paired (78.23% : N/A) 28146 + 0 with itself and mate mapped 314 + 0 singletons (1.09% : N/A) 5196 + 0 with mate mapped to a different chr 2352 + 0 with mate mapped to a different chr (mapQ>=5)

As you recommended I used bedtools, It still show the error. But I checked the files and the reads are included.

egrep '@' subset_bedtools_R1.fastq | wc -l          => 14525

And the same number for the R2.

I think this is a GREAT approach to include all the paired-end in a desired region. And thank you!!! for pushing me in this direction.

However, I'm still wonder why still the number of reads obtained with flagstat differs from the number of reads in the fastaq file!

ADD REPLY
0
Entering edit mode

Sounds like you are still skipping reads that have only one of the pairs, then. You can either put those back in manually (by finding read names that only appear once, then converting to a non-paired fastq file to go along with your paired data), or try to figure out why they're still missing. Are those read names paired in your original bam file?

ADD REPLY
1
Entering edit mode

Hi Chis, I think, I found the issue in bedtools. It seems that, if the read was mapped in multiple locations, and the number of times this read appears is higher than double (e.i 4, 5..etc). Bedtools interpret them as "another" paired-end couple. And therefore, It appears duplicated in the sample.

In my case, the maximum number of times a read was counted (aligned) was 5, and I found 138 duplicated reads. I'm pretty sure that, if I would had a read aligned 6 or 7 times, this would appear double duplicated in the bedtools file.

I'm going to use picard or TopHat to convert from bam, to fastq.

Again, Thanks so much for your help!!

ADD REPLY
0
Entering edit mode

Makes sense - good detective work!

ADD REPLY
0
Entering edit mode

Yes, it weird.

I check the bam file with samtools view subset_complete_sorted.bam | cut -f1 > readnames and then sort and uniq -u readnames and It didn't throw out anything.

I also run Picard and this time the "unpaired mates ERROR" didn't showed up. But the Number of reads are the same as the samtools flagstat said 14387 in both files. There are 138 reads that BEDtools is "over-recognizing".

ADD REPLY

Login before adding your answer.

Traffic: 1641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6