Question

Quantify the number of sequences in common paired-end sequencing data

0

Entering edit mode

15 months ago

sil_bioinfo ▴ 50

Hello,

I have two fastq files with the forward(Read1) and reverse(Read2) paired reads. How could I count the number of sequences in common between Read1.fastq and Read2.fastq files? (I mean, since they have the same SeqID)

And, how could I count the number of sequences with Read1/Read2 overlapping?

Thank you in advance,

best regards,
Silvia

fastq paired-end • 1.4k views

ADD COMMENT • link updated 15 months ago by Ram 44k • written 15 months ago by sil_bioinfo ▴ 50

score 2 · Accepted Answer · 2023-09-15

2

Entering edit mode

15 months ago

GenoMax 148k

count the number of sequences in common between Read1.fastq and Read2.fastq files?

You are looking for a overlap by alignments? There can be no common sequences in theory since you are sampling the two ends of libraries fragments.

how could I count the number of sequences with Read1/Read2 overlapping?

You can use bbmerge.sh from BBMap suite, FLASH (and similar) tools that allow you to merge the R1/R2 reads into a longer read representation (if the reads actually overlap).

ADD COMMENT • link 15 months ago by GenoMax 148k

0

Entering edit mode

By "sequences in common" I mean sequences with the same SeqID in Read1 that also appear in Read2. I don't know if I'm explaining myself well, sorry. I think that with a small code in Python where for each sequence in Read1.fastq, it looks for it in Read2.fastq and if it finds it, it counts it... But since I'm a bit new to bioinformatics I was wondering if someone could help me... thank you!

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50

1

Entering edit mode

Actually each sequence in R1 file should have a corresponding sequence in R2 file. It they don't then your files are out of sync and there may be an unequal number of sequences in both files.

If you look at the Illumina fastq headers only thing that is different in the two headers would be number highlighted by *. First sequence is from R1 and the second one from R2

@test:test:1:1101:49570:1019 *1*:N:0:AAGTACAG+GACGTGAC
GGGTCTTCTCGTCTTTTAAATAAATTTTAGCTTTTTGACTAAAAAATAAAATTCTATAAAAATTTTAAATGAAACA 


@test:test:1:1101:49570:1019 *2*:N:0:AAGTACAG+GACGTGAC
GGGTCTTCGCTAGCTAGCTAGCGCGAGCGCGATCGAGCTACGACTACAGCATTCTATAAAAATTTTAAATGAAACA

ADD REPLY • link 15 months ago by GenoMax 148k

0

Entering edit mode

Well, I forgot to say that, the Read1 and Read2 fastq files were filtered by quality measures (in Read 1, reads with a mean Phred quality score lower than 20, and for Read 2 lower than 15, were removed). So now, in both Read1 & Read2 fastq files I have a different number of sequences. And I would like to know how many sequences are in common now.

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50

1

Entering edit mode

I figured we were going in this direction but wanted to be sure. You need to use a different utility from BBMap suite called repair.sh that will bring your R1/R2 files back in sync and remove the singletons.

See how to use it here: How to use BBtools repair.sh on multiple files

In future please scan/trim process paired-end data files together so you don't end-up in this situation.

ADD REPLY • link 15 months ago by GenoMax 148k

0

Entering edit mode

Oh, I understand.

Then, just to be sure: to get the "sequences in common" with just using BBMap/repair.sh is done, right? and to get the "sequences where R1/R2 overlap" I can use the BBMap/repair.sh first, and then the "BBMap/bbmerge.sh", right?

Sorry for the confusion, I'm new here. Next time I will share some data to make it easier. Thank you in advance

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50

2

Entering edit mode

Correct. repair the files in first step to get R1/R2 sequences back in sync, setting singletons aside. Then merge the ones that are able to merge using the "re-paired" files.