Losing reads from bam to fastq
2
0
Entering edit mode
5.2 years ago
JC • 0

Hello everyone,

Nowadays I am using bowtie to align some small sequences (27 - 31 nucleotides) to a reference sequence. When I write:

grep "^>" Sequences.fa  -c
52772

I get 52772 sequences in Sequences.fa However after mapping to the reference sequence and extracting all the reads I obtain a file with fewer reads:

grep "^>" Sequences_after_mapping.fa  -c
52503

The commands that I wrote to align Sequences.fa was the following:

bowtie -f -a --best --strata -v 0 Inputs/Reference_sequence -f Sequences.fa --sam Sequences.sam
samtools view -Sb Sequences.sam  > Sequences.bam
rm -rf Sequences.sam
samtools bam2fq Sequences.bam | seqtk seq -A - > Sequences_after_mapping.fa

And finally the report that I get from the script it's the following:

# reads processed: 52772
# reads with at least one reported alignment: 827 (1.57%)
# reads that failed to align: 51945 (98.43%)
Reported 827 alignments to 1 output stream(s)
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 52772 reads

I am not able to find out what happened for getting fewer sequences. I would appreciate any help that you can bring me.

PD: The input file is a fasta file, my bowtie version is 1.1.2 and samtools 1.9.

PD2: If I remove the option --strata the number of sequences do not change.

ANSWER: Finally, I noticed that the fasta sequences in Sequences.fa should have different names for each sequence if not samtools extract whatever it wants. Thank you to everyone for all.

alignment • 2.6k views
ADD COMMENT
0
Entering edit mode

I think bowtie is not the issue because 827+51945=52772

To find what cause the problem, how many reads do you get from samtools bam2fq input.bam > output.fastq

And did you tried:

samtools flagstat aln.sorted.bam

ADD REPLY
0
Entering edit mode

From samtools bam2fq input.bam > output.fastq I obtain 52503 reads. And here it's the output of samtools flagstat:

52772 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
827 + 0 mapped (1.57% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
0
Entering edit mode

How do you get to that 52503 number? what's the command you use to count them?

ADD REPLY
0
Entering edit mode

The input file is a fasta file

Then why are you using samtools bam2fq? Why not samtools fasta?

Have you also looked at this:

 To get all non-supplementary/secondary reads in a single file, redirect the output:
   samtools bam2fq -F 0x900 in.bam > all_reads.fq
ADD REPLY
0
Entering edit mode

Because I didn't know that tool. Anyway samtools fasta and samtools fasta -F 0x900 in.bam > all_reads.fa outputs the same number of reads (52503) that are less than the original fasta (52772).

ADD REPLY
0
Entering edit mode

Are you missing reads that did not align from the output? You could try this option to see if you recover the reads in that file?

--un <filename> Write all reads that could not be aligned to a file with name <filename>. Written reads will appear as they did in the input, without any of the trimming or translation of quality values that may have taken place within Bowtie. Paired-end reads will be written to two parallel files with _1 and _2 inserted in the filename, e.g., if <filename> is unaligned.fq, the #1 and #2 mates that fail to align will be written to unaligned_1.fq and unaligned_2.fq respectively. Unless --max is also specified, reads with a number of valid alignments exceeding the limit set with the -m option are also written to <filename>.

ADD REPLY
0
Entering edit mode

As gb said before I think that bowtie it is not the problem. Anyway I did what you said:

bowtie -f -a --un --best -v 0 Inputs/Reference_sequence -f Sequences.fa --sam Sequences.sam
samtools fasta Sequences.sam > All_Sequences.fa

And outputs 52503 again.

ADD REPLY
0
Entering edit mode

I think you need to provide a file name for the --un option to save those reads. something like --un unmapped_reads.fa. See if reads end up in that file.

ADD REPLY
1
Entering edit mode

maybe it is this?

-s FILE Write singleton reads to FILE

For samtools fasta

http://www.htslib.org/doc/samtools.html

ADD REPLY
0
Entering edit mode
5.2 years ago

How about:

--strata           hits in sub-optimal strata aren't reported (requires --best)
ADD COMMENT
0
Entering edit mode

Removing --strata do not change the number of sequences.

ADD REPLY
0
Entering edit mode
5.2 years ago
jean.elbers ★ 1.7k

Maybe try bazam https://github.com/ssadedin/bazam

java -jar /path/to/bazam/jar/bazam.jar -bam  Sequences.bam | /path/to/seqtk/seqtk seq -A - > Sequences_after_mapping.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6