Hello everyone
I have performed quality analysis of RAW miRNA-seq data using FastQC. As a result, I got the following output and the encoding according to FastQC appears to be illumina v1.5 encoding. I next want to perform adaptor removal and trimming of low quality reads. Based on this, could anyone please tell me which is the Phred score value to use in trimming step in tools such as fastx toolkit or cutadapt or any other better tool?
Regards
Zammy
Thank you for your reply Brian
I have seen another post where you suggested BBDuk. Well it makes sense as you are the developer of it and the comparative analysis you performed between different tools also favors BBDuk :). I don't have access to linux machine right now so I hope it will run on windows 7. Could you please tell me how to run BBDuk in windows as it is stated on the download page that it is platform independent. The intended use of data is to perform differential miRNA expression analysis between two groups. So the next step after adapter removal is to perform mapping to hg19 and miRBase. Its not paired-end data
For the sake of my understanding, could you please explain if I have to define Phred score in trimming step, what would that be according to above figures? This is what I am trying to figure out. Someone suggested me to use 33 but I am not sure how exactly it works as in the logic behind.
Regards
Zammy,
The ASCII offset for modern Illumina reads is 33, meaning a base with Q0 will have ASCII code 33 (!) and are generally Ns; there is no lower quality score (except in some of Illumina's old formats when they used negative values, just to be different). A normal Illumina run currently produces Q0 for N, Q2 for bases that the machine thinks are junk, and 5-41 for real qualities of real bases. Trimming to Q10 means you discard bases that the machine thinks have a greater than 10% chance of being wrong; Q20 means greater than 1%; and Q30 means greater than 0.1%. High levels throw away useful data, and low levels retain more errors. If you are doing something highly tolerant of errors, quality-trimming (beyond, perhaps, Q1) is not needed or necessarily useful. For assembly, trimming to higher values like 10 often is useful.
To run BBDuk in Windows:
Unzip it and untar it to somewhere; let's say
C:\bbmap\
. You can unzip and untar it with 7-zip, my favorite Windows compression utility.Then run:
That will do quality and adapter trimming. You can adjust the
trimq
to a different value depending on the stringency; for mapping, Q6 should be fine. If you have paired end reads in two files, addin2=
andout2=
with additional file names. BBMap also runs in Windows, by the way, though it needs 24GB for hg19.Oh, and I certainly encourage you to try other trimmers, and post back here if they did a better job :)
Thanks, I am gonna try it now...My apologies but I am not very expert with command line things. The command you have written above is supposed to be run in cmd or windows command command prompt, correct? Furthermore, is there a manual or
-h
command to explain the script ? in and out are making sense but others such as ref, hdist etc and how to give input of my adapter sequence.Regards
bbduk.sh -h
will work in Linux, but not in Windows. You can openbbduk.sh
(which would be inC:\bbmap\
) and view it with a normal editor, though, like Wordpad or Notepad++ (NOT Notepad which cannot handle newlines correctly); it explains the definitions of all the flags in plain text.hdist=1
means "allow 1 mismatch";hdist
is short for "Hamming distance".ref
is the adapter files; the ones I listed will work if you used normal Illumina TrueSeq or Nextera adapters, but if you have custom adapters, then point it to those instead. They can either be in a fasta-formatted file, or just the literal adapter sequence if you use theliteral
flag, e.g.literal=ACCGTGTCCAGTATTGGTCAGGCCT
.ref
means "reference sequence" (in this case, adapters);qtrim=rl
means quality-trim the right and left sides;trimq=6
means trim regions with average quality below Q6;ktrim=r
means "look for adapter kmers, and when found, trim to the right";k=23
means "try to match 23-mers between the read and reference";mink=11
means "use kmers as short as 11 at the end of the read";tbo
andtpe
allow better adapter trimming of reads that are paired and will have no effect on single-ended reads.As for how you run it - you can go to start->run-> and type
cmd
then hit enter, which will give you a command prompt in windows. In Windows 7 "run" is the little text box at the bottom of the start menu. Then usecd
to navigate to the appropriate location; e.g.cd C:\data\
if your reads are there.Please note that BBTools require Java which is free but not preinstalled on Windows, so if you don't have it, you'll need to download and install from here.
Following is the output after running above command. Trying to figure out if its okay.....
Since your adapter sequence is only 20bp, you need to set "k=20" or lower - otherwise, there cannot possibly be any kmer matches.
I was wondering is it possible that FastQC incorrectly detects encoding method? and is the sequence of adapter also relates to the encoding version? for example:
The illumina v1.0 small RNA 3p adapter "TCGTATGCCGTCTTCTGCTTG" and the
The illumina v1.5 small RNA 3p adapter "ATCTCGTATGCCGTCTTCTGCTTG"
Now in my case, FastQC tells me that encoding method is 1.5 whereas the adapter sequence in my RAW dataset is of v1.0. I also checked this via manual Find option in Kate text editor that is showing the v1.0 adapter at the end of my reads. However, there are 44717 reads which have 3' sequence matching to v1.5 adapter in the sequence file. Attached is the screenshot of first few lines of RAW file. So what could be the actual encoding method here and again how it relates to the selection of quality value
The encoding method is talking about quality encoding, not adapter sequences.