Hello all,
I have a large Illumina dataset which consists of a 300bp (approx) forward run and a 300bp (approx) reverse run which, in theory, should be concatenated to form 600 bp reads. My plan is to reverse compliment the reverse run and concatenate these bps to the forward run, but I having a difficult time achieving this. I am primarily a R programmer; however, this dataset is quite too large for an R script to be used (read too much computing time). I have considered using the bash paste function as well, but I am unable to get the script to perform properly. The same goes for python.
I also have concerns about the quality of the reads on each end as Illumina reads tend to be poorer quality on the ends of the sequences. I have attempted to use Trimmomatic, but if I am not mistaken, this tool is designed for overlapping paired-end reads. Same goes for fastq-join and other tools. I have found a thread or two on this topic via googling, but nothing has produced proper results.
Basically, I have two fastq files (R1=forward and R2=reverse). How can I concatenate these together to create one fastq file (R3) which is the result of each read from R1 concatenated to the same read in R2? Is this even the proper approach?
You can't merge those reads unless they have a real overlap in the middle
(R1 ---<-->---- R2)
. Depending the length of the overlap you would get an extended read that would be600-(overlap bases)
in length.There are premade tools that do this merging for you. BBMerge from BBMap, FLASH are examples you should try.
If those tools are unable to merge the reads then it is possible that you had inserts that were longer than the sequencing length (> 600 bp) and the reads can't be merged directly.
1) What is your reason for concatenating the R1 and R2 sequences?
2) Trimmomatic (and other read trimmers) do not require overlapping reads. The data can be trimmed by a variety of criteria (base quality, adapter removal, fixed length).
3) Have you assessed the quality of your data using FASTQC or similar?
That depends on what you want to do downstream with the data, but most often there is no need to concatenated R1 and R2...