Hi, I have received fastq files containing the reads from Illumina MiSeq. Since they are paired-end, there is an R1 and an R2 file for each sample. So I expected to find reads beginning with our forward primer in the R1 files, and reads beginning with our reverse primer in the R2 (or vice versa). However, I find both in both; i.e. about half of the reads in the R1 files begin with the forward primer, and half with the reverse primer; and same with the R2s. I tried merging them, but this results in about half of the reads being reverse complemented, and this makes things more complicated downstream, so I would like them to all go in the same direction. I thought to grep for each of the primers, but because of ambiguities and some still having short tags on the beginning, I don't think it's going to work--plus I thought they weren't supposed to be mixed anyway...??? Maybe I don't understand this as well as I thought. Any ideas? Thanks.
Could you elaborate more on the library was prepared?
It sounds like your amplicon library was constructed by the standard Illumina method (i.e., adaptor ligation) and sequenced with standard Illumina (adaptor) primers. If so, then you'd expect a 50/50 mix of amplicon orientations. But @WouterDeCoster is correct, we'll need more details about library prep (e.g., what are the short tags to which you refer) to help you parse the data.
The primer sequences you use in the sequencing step, use the adaptors you link to your fragmented DNA or cDNA)
And the joining of these adapters to these pieces of DNA is fully random (don't get into consideration direction) excepting when you are using a stranded transcriptomic protocol
If using genomic sequences, I am not aware of a protocol that will allow you to get directional libraries, though
Just plain (multiplex) PCR based enrichment & library prep can be directional.
So what are the other methods of amplicon library prep? what if I do not want the 50/50 mix of amplicon orientation?
PCR-based methods (as opposed to ligation) will produce directional libraries. You can either incorporate the Illumina adapter sequences into your amplicon primers, or add them via two rounds of PCR (first round with amplicon primers, second round with Illumina adapters + amplicon overhang).
Have you tried to scan the data with a trimming program? I suggest
bbduk.sh
from BBMap suite. You may have inserts that at smaller than the length of sequencing. While you are at it you could also usebbmerge.sh
from the same suite to see what you get in terms of merging of R1/R2 reads.