Hi all,
I've got a bit of a weird result with some reads I've simulated. I've used BBMaps' randomreads.sh to simulate some illumina paired end reads, and as I'm playing around with introducing errors I decided to run FastQC and multiQC on the data. Weirdly, my reads are showing a large amount of illumina universal adaptor on the 3' end of the reads, yet I haven't added any in at any stage. I've searched randomreads.sh' docs and it doesn't seem to add anything in automatically, and so I'm left pondering where these adaptors have come from.
Does anyone know if randomreads.sh adds any in automatically during simulation? Or is there a chance the reference genome from which I'm simulating the reads contains adaptors and that's where they're coming from? The reference is very high quality and was put together by the IWGSC so I doubt it, but you never know I guess. Or is there a chance this is a false positive hit by FastQC?
Any insight is greatly appreciated.
Hi GenoMax,
Yeah, when I saw those parameters I figured randomreads.sh was unlikely to be the culprit, but I'm completely at a loss. Here's an image from the multiQC for a bunch of reads I simulated.
If that plot is for 6 samples then it is suspiciously uniform/overlapping. Can you take the sequence of Illumina universal adapter and do a quick check with your reference to see if there are identical hits for that? (
seqkit grep
would be useful).It's for 3 samples but 6 read files as they're paired-end. I'll go ahead and run that check now and get back to you when it's done.
Hi genomax,
Thanks for the seqkit grep suggestion, I've never used it before but it's rapid! So I can find the illumina universal adaptor sequence within the assembled reference chromosomes and looking at the reference genome publication it's clear they've performed illiumina adaptor removal on the raw reads. As such, I can only assume either a small amount of adaptor sequence somehow made it past their filters and got assembled or it's a real part of the genome that happens to share those base pairs with illumina adaptors. If the latter is true that sounds problematic for adaptor checking and removal for real data as a chunk of real genomic sequence could erroneously be removed.
Either way, my mystery is solved so thank you very much!