Hello,
I have two fastq files with the forward(Read1) and reverse(Read2) paired reads. How could I count the number of sequences in common between Read1.fastq
and Read2.fastq
files? (I mean, since they have the same SeqID)
And, how could I count the number of sequences with Read1/Read2 overlapping?
Thank you in advance,
best regards,
Silvia
By "sequences in common" I mean sequences with the same SeqID in Read1 that also appear in Read2. I don't know if I'm explaining myself well, sorry. I think that with a small code in Python where for each sequence in Read1.fastq, it looks for it in Read2.fastq and if it finds it, it counts it... But since I'm a bit new to bioinformatics I was wondering if someone could help me... thank you!
Actually each sequence in R1 file should have a corresponding sequence in R2 file. It they don't then your files are out of sync and there may be an unequal number of sequences in both files.
If you look at the Illumina fastq headers only thing that is different in the two headers would be number highlighted by
*
. First sequence is from R1 and the second one from R2Well, I forgot to say that, the Read1 and Read2 fastq files were filtered by quality measures (in Read 1, reads with a mean Phred quality score lower than 20, and for Read 2 lower than 15, were removed). So now, in both Read1 & Read2 fastq files I have a different number of sequences. And I would like to know how many sequences are in common now.
I figured we were going in this direction but wanted to be sure. You need to use a different utility from BBMap suite called
repair.sh
that will bring your R1/R2 files back in sync and remove the singletons.See how to use it here: How to use BBtools repair.sh on multiple files
In future please scan/trim process paired-end data files together so you don't end-up in this situation.
Oh, I understand.
Then, just to be sure: to get the "sequences in common" with just using BBMap/repair.sh is done, right? and to get the "sequences where R1/R2 overlap" I can use the BBMap/repair.sh first, and then the "BBMap/bbmerge.sh", right?
Sorry for the confusion, I'm new here. Next time I will share some data to make it easier. Thank you in advance
Correct.
repair
the files in first step to get R1/R2 sequences back in sync, setting singletons aside. Thenmerge
the ones that are able to merge using the "re-paired" files.Perfect, thank you so much for your help!!