Hello all. This seems to be a routinely discussed question with many answers around here, however I could not use the answers provided in other questions to solve my query. I have some mi-RNA seq data from Illumina Hiseq platform. Thats about all the information I have with me. I have not been able to identify the vendor who has done the sequencing, so approaching them is out of question. My problem is as follows : I have single end sequencing reads of 54 base length. I am trying to identify a good way to trim them. I have no idea what adapter to use for read trimming, so I have been stupidly looking a t other posts on here trying to make sense. Long story short, as suggested on some posts, my FastQc over represented sequence output gives me these two sequences as the adapter sequences in one sample :
AGCCGCCTGGATACCGCAGCTAGGAATAATGGAATTCTCGGGTGCCAAGG 189653 0.410031497 Illumina Small RNA Adapter 2 (100% over 21bp) CGCGACCTCAGATCAGACGTGGCGACCCGTGGAATTCTCGGGTGCCAAGG 184505 0.398901475 Illumina Small RNA Adapter 2 (100% over 21bp)
and these 3 sequences as the adapter in a different sample.
AGCCGCCTGGATACCGCAGCTAGGAATAATGGAATTCTCGGGTGCCAAGG 189653 0.410031497 Illumina Small RNA Adapter 2 (100% over 21bp) CGCGACCTCAGATCAGACGTGGCGACCCGTGGAATTCTCGGGTGCCAAGG 184505 0.398901475 Illumina Small RNA Adapter 2 (100% over 21bp) TTGCTGTGATGACTATCTTAGGACACCTTTGGAATTCTCGGGTGCCAAGG 50032 0.108169635 Illumina Small RNA Adapter 2 (100% over 21bp)
Now these are two different samples run in different lanes. I do not know if sequencing was pooled with an indexing adapter (although that is very likely given the total number of reads being small.) after matching over the four sequences I have deduced that TGGAATTCTCGGGTGCCAAGG is my illumina adapter sequence. The problem is I cannot find any mention of this being a adapter sequence in any of illumina's official documents on their FTP, other than this document http://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/basespace/small-rna-v1-0-release-notes-15061994-a.pdf. Is this the correct sequence?
TGGAATTCTCGG is the start of the standard illumina small RNAseq adapter. If you just give that to trimmomatic (or run "Trim Galore!" with the
--small_rna
option) then you should be fine.Hey Devon thanks a lot for chimming in. I did try using trim galore and the sequences mentioned in its manual (which includes the one that you suggested!. Problem is it hardly trims any of my data with peaks still at 50 bases! This is the same result that I get after using trimmomatic. Am I looking at primer dimers here?
What does the FastQC adapter contamination plot look like? There should be a huge jump up in the percentage ~50% of the way through if this is really smallRNAseq.
heres the QC for trimmomatic read trimming using the adapter.
https://s32.postimg.org/3zo4uuz8l/sequence_length_distribution.png
I tried something new. Using the default settings on capmirseq, which uses cutadapt for trimming, I gave the same adapter. This is the post trimming image for the sequence length distribution:
https://s31.postimg.org/l77gt8t5n/sequence_length_distribution.png
Now I am wondering what the peak at 33 signifies :/
If you want trimmomatic to trim more just reduce the size of the sliding window and increase the quality required.
I remember spending a lot of time on QC, then I actually did the alignment and found it wasn't such a big deal usually the alignment algorithm takes into account quality scores, unless you have something really weird going on with your data - you should be OK.
If it is vastly over represented I just remove them because what else could they be? I have found trimmomatic to be a fast tool because it is multi core enabled, you may have your own solution however.
Thanks a ton chris! I tried doing the same using trimmomatic, with the following parameters java -jartrimmomatic-0.33.jar SE -phred33 21A.fastq 21A_clipped.fastq ILLUMINACLIP:final_adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:18.
Doing this gives me an output which still has a length distribution peak at 50 bases which imply that a majority of my sequence reads have not really been t rimmed. Am I doing something really really stupid here? Should I have not been just born in this world? Many thanks for taking the time out and helping a fellow distressed soul!
It depends on the quality and how you set the sliding window. You will have to read the manual and play around until you get it to work. I remember it taking a while for me, but once I did it was nice and fast.