Question

I have more reads after extracting those that are mapped...

0

Entering edit mode

4.1 years ago

agtbeeman • 0

Hello here,

I have 10 samples :

   file1_R1.fq
   file1_R2.fq
   ...
   file10_R1.fq
   file10_R2.fq

And also a Trinity.fasta file. I would like to extract from my reads file, the paired reads that are mapped to Trinity.fasta.

I first ran bowtie (that shows the right amount of reads 33 983 651 *2 for file1)

Then I have sorted my 10 bam files using samtools, getting 10 sorted.bam files (like file1.sorted.bam ...)

Eventually I have run bam2fastq :

 for i in $(ls *.sorted.bam|sed 's/.sorted.bam//g'|sort -u)
 do
 bam2fastq --aligned --force --strict -o "$i"_mapped'#.fq' "$i".sorted.bam
 done

However when I look to the output file1_mapped.fq I have about 100 000 000 reads ! that is to say 4 times more than originally !! Does anyone can help me to solve this sorcery?

software error rna-seq alignment sequence • 2.6k views

ADD COMMENT • link updated 4.1 years ago by finswimmer 16k • written 4.1 years ago by agtbeeman • 0

score 1 · Answer 1 · 2020-10-30

1

Entering edit mode

4.1 years ago

GenoMax 147k

A read can multi-map to more than one location. Those are called secondary alignments. If you are not careful about excluding them when reconverting BAM files back to fastq reads you will end up with multiple copies of reads which can explain the phenomenon you observe.

Use appropriate option for bam2fastq to select only primary alignments for reads when doing the conversion. Or use A: Regarding filter primary alignment to pre-filter the BAM to get only primary alignments before conversion.

ADD COMMENT • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Ok thank you so I m doing : samtools view -F 256 input.bam > output.bam then I sort the bam and run bam2fastq.

Do you know if can use this command directly on my sorted.bam files ? That would make be gain a lot of time !

ADD REPLY • link 4.1 years ago by agtbeeman • 0

0

Entering edit mode

Mmmh I don't understand my output bam files are heavier than the input file ... And when I try to sort it I get error: "samtools sort: truncated file. Aborting" (edit it 's because the output is a sam fil i fixed it with -b option)

ADD REPLY • link 4.1 years ago by agtbeeman • 0

0

Entering edit mode

Most conversion programs require the reads to be name sorted when you do the BAM -> fastq conversion. You should be able to do all of this via samtools

samtools view -F 256 input.bam | samtools sort -n - | samtools fastq -1 R1.fastq.gz -2 R2.fastq.gz -

(check other options you need for fastq conversion)

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks I have tried this exact command.

When i do count the number of reads for file3, I get 35354937 reads for R1 and 35328712 for R2 (shouldn't it be the same number ?).

Do you know to which reads it corresponds to ?( cf my bowtie stats bowtie.stats_file3 ).

Ideally I would like to only get the pairs that aligned exactly one time or more ! But I don't know how to modify my code for that.

ADD REPLY • link 4.1 years ago by agtbeeman • 0

0

Entering edit mode

Looks like you have some R1 reads in your file where the mate had not mapped. If you only want reads that are a pair then you can use repair.sh to remove singletons from R1 file. Here is a guide to use repair.sh.

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Honestly I am a bit lost. For me there is a problem in both R1 and R2 since I expect 62,31% +22,41 % of the total that is to say 22510215+8106913 = 30617128 reads in each R1 and R2 fq files.

I ran also a flagtest on the sorted bam (previous running samtools fastq) :

0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
70682805 + 0 mapped (100.00% : N/A)
70682805 + 0 paired in sequencing
35354093 + 0 read1
35328712 + 0 read2
61234256 + 0 properly paired (86.63% : N/A)
70248252 + 0 with itself and mate mapped
434553 + 0 singletons (0.61% : N/A)
7530870 + 0 with mate mapped to a different chr
3184550 + 0 with mate mapped to a different chr (mapQ>=5)

That is to say I feel like the last command extracts all reads and not only the paired. Besides in the properly paired line there is exactly the number of reads i expect (61234256 /2 = 30617128 ) !!

ADD REPLY • link 4.1 years ago by agtbeeman • 0

score 1 · Answer 2 · 2020-10-30

1

Entering edit mode

4.1 years ago

finswimmer 16k

Hello,

according to this docs bam2fastq assumes that each read is aligned only once in the bam file. This is rarely the case. There are always a bunch of reads that are mapped multiple times. So I guess bam2fastq will extract the same read, that is mapped multiple time, multiple times.

I would suggest you give samtools fastq a try.

fin swimmer

ADD COMMENT • link 4.1 years ago by finswimmer 16k

0

Entering edit mode

Thank you for the explanation ! However I don't see any mention of aligned options nor primary alignment in the doc you gave me !

ADD REPLY • link 4.1 years ago by agtbeeman • 0