Reformatting Merged and Unmerged Paired-End Files
1
0
Entering edit mode
8 months ago

Hello everyone,

I had paired-end data which I merged using fastp and performed several prepocessing analysis on them. As a result, I ended up with merged, unmerged_paired_1, and unmerged_paired_2 files.

Now, I want to reformat these processed files back into paired-end format. How can I do this correctly?

So assuming for one sample:

merged_reads : 1,200,000
unmerged_reads_1 : 1,800,000
unmerged_reads_2 : 1,800,000

I want to get paired-end format again.

reformat.sh in=merged_reads out1=forward.reads out2=reverse.reads
cat forward.reads unmerged_reads_1  > final_paired_1.reads
cat forward.reads unmerged_reads_2  > final_paired_2.reads

Is that technically correct? If not, do you have any suggestions to do it in a correct way. Thank you.

illumina fastp fastqc bbmap • 555 views
ADD COMMENT
0
Entering edit mode
8 months ago
GenoMax 147k

In theory this is correct. I would add a repair.sh step at the end to make sure reads are in sync (in terms of order) across the two final files.

ADD COMMENT
0
Entering edit mode

Thank you GenoMax for your answer. I want to add some details concerning this puzzling and extend the topic. Maybe It is a bit out of the topic but related to my question anyway.

Here, I follow the procedure what I explained and I mapped two different format to a reference using bowtie2 and salmon. Here are the results for different format for the real data.

For merged and unmerged format:
bowtie2 -x reference.fa -U merged.reads -S merged.sam 
bowtie2 -x reference.fa -1 unmerged_reads_1 -2 unmerged_reads_2 -S unmerged_paired.sam
samtools sort -o merged_sorted.bam merged.sam
samtools sort -o paired_sorted.bam paired.sam
bamtools -in merged_sorted.bam -in paired_sorted.bam -out merged_paired_sorted.bam

flagstat results of merged_paired_sorted.bam:
97540614 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
52284671 + 0 mapped (53.60% : N/A)
65027076 + 0 paired in sequencing
32513538 + 0 read1
32513538 + 0 read2
24309546 + 0 properly paired (37.38% : N/A)
31709744 + 0 with itself and mate mapped
2677439 + 0 singletons (4.12% : N/A)
3827400 + 0 with mate mapped to a different chr
1828106 + 0 with mate mapped to a different chr (mapQ>=5)

and here same reformat paired end reads I used for salmon-mapper. I know bowtie2 and salmon are totally different aligner but I am confused at that point. Here only one paired was considered: 48770307

salmon quant -i reference.index --meta --libType IU -1 final_paired_1.reads -2 final_paired_2.reads -o output

Log File:

Observed 48770307 total fragments (48770307 in most recent round)   
[2024-01-18 19:57:45.215] [jointLog] [info] Computed 2,960,309 rich equivalence classes for further processing
[2024-01-18 19:57:45.215] [jointLog] [info] Counted 26,417,982 total reads in the equivalence classes
[2024-01-18 19:57:45.257] [jointLog] [warning] 0.000508506% of fragments were shorter than the k used to build the index.
If this fraction is too large, consider re-building the index with a smaller k.
The minimum read size found was 15.


[2024-01-18 19:57:45.257] [jointLog] [info] Number of mappings discarded because of alignment score : 30,255,045
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 12,287,142
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2024-01-18 19:57:45.257] [jointLog] [info] Number of fragments discarded because they have only dovetail (discordant) mappings to valid targets : 1,348,537
[2024-01-18 19:57:45.257] [jointLog] [info] Mapping rate = 54.1682%

In bowtie2 result, when I divided total reads,97540614/2, I get the same number as Salmon says. So, why should I expect Salmon not give me 97540614 for initial reads but bowtie2 does? That's why I was afraid I made a mistake about reformating the reads but I think It is not about reformatting issue. What do you think?

ADD REPLY
0
Entering edit mode

This is indeed off topic for original question. Programs may account of input reads in different ways.

I suggest that you verify that reads are in sync in your files first. If they are not then the entire results above is null and void.

ADD REPLY

Login before adding your answer.

Traffic: 2797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6