Hello everyone.. I have mate pair sequencing data with 150 bp length and when i checked quality of reads using fastq and i found the quality is good.
#Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
1 33.93390648005911 35.0 35.0 35.0 31.0 35.0
2 33.95907029321767 35.0 35.0 35.0 31.0 35.0
3 33.98077552343845 35.0 35.0 35.0 31.0 35.0
4 33.98753499777637 35.0 35.0 35.0 31.0 35.0
5 33.984543748770804 35.0 35.0 35.0 31.0 35.0
6 38.591631930233014 40.0 39.0 40.0 36.0 40.0
7 38.56701433221178 40.0 39.0 40.0 36.0 40.0
8 38.55072589489948 40.0 39.0 40.0 36.0 40.0
9 38.53976502189077 40.0 39.0 40.0 36.0 40.0
10-14 38.50712970571924 40.0 39.0 40.0 36.0 40.0
15-19 38.4177180935979 40.0 39.0 40.0 36.0 40.0
20-24 38.38606423538208 40.0 39.0 40.0 36.0 40.0
25-29 38.29027050300057 40.0 39.0 40.0 35.2 40.0
30-34 38.20144945809037 40.0 39.0 40.0 35.0 40.0
35-39 38.11811796390104 40.0 39.0 40.0 34.0 40.0
40-44 38.0175315710861 40.0 39.0 40.0 34.0 40.0
45-49 37.935398773549636 40.0 39.0 40.0 34.0 40.0
50-54 37.84706168928388 40.0 39.0 40.0 34.0 40.0
55-59 37.74265376076599 40.0 38.4 40.0 34.0 40.0
60-64 37.64332664650341 40.0 38.0 40.0 34.0 40.0
65-69 37.52005951881047 39.8 38.0 40.0 34.0 40.0
70-74 37.38430764378722 39.0 38.0 40.0 31.6 40.0
75-79 37.26014222004879 39.0 38.0 40.0 31.0 40.0
80-84 37.12272526413333 39.0 37.2 40.0 31.0 40.0
85-89 36.964182674129546 39.0 37.0 40.0 30.2 40.0
90-94 36.81203894449435 39.0 37.0 40.0 30.0 40.0
95-99 36.65255631448299 39.0 36.4 40.0 29.2 40.0
100-104 35.66682387627061 38.0 34.6 39.2 27.0 39.6
105-109 36.83309679602688 39.0 36.8 40.0 30.0 40.0
110-114 36.80732298238993 39.0 37.0 40.0 30.0 40.0
115-119 36.60025361051608 39.0 36.4 40.0 28.2 40.0
120-124 36.388457503902416 39.0 36.0 40.0 27.0 40.0
125-129 36.12575631519171 39.0 36.0 40.0 27.0 40.0
130-134 35.90247099450205 39.0 36.0 40.0 27.0 40.0
135-139 35.6381126201069 39.0 35.2 40.0 27.0 40.0
140-144 35.37785752835347 39.0 34.6 40.0 26.8 40.0
145-149 35.04320800576903 39.0 34.0 40.0 19.2 40.0
150-151 33.066291537988604 37.0 30.5 39.5 16.0 40.0
Now i want to assemble the data but i am confused because i do not know that adopters are removed from this dataset or not so any one can tell me about the steps of preprocesscing the sequencing data and i want to know also after the preprocessing length of reads are same in R1 or R2 file ???
Thanks
Here is the complete list of adapters commonly used on illumina platforms; illumina adapters, then you can trimm them with so many different software (command line based on linux) as biopieces
ohk, i will check it...can you please tell me steps for preproccesing genomic data ??? Thanks
It depends, what are you looking for? de novo assembly? referenced assembly? find SNP`s? etc etc
i am looking for denovo assembly and snp analysis ??
mate pair =/= pair-end sequencing. Mate-pair data requires different handling.
Just wanted to make sure.
Use
bbduk.sh
from BBMap suite for trimming. Adapter sequences for most common commercial kits are included inadapters.fa
file inresources
directory in the software bundle. How to use BBduk.No the read lengths do not need to be (and may not be) of same length after they are scanned and trimmed. You do want to trim R1/R2 reads together since the order of the reads in two files is important. If you lose one read in one of the files, its mate needs to be removed from the other. bbduk is paired-end aware and will take care of this for you.
Thank You for your reply....I send u job ID of printseq 31343934383336313632. Will you please see this data and tell me this data contain adapter sequence or not. I am surprised with results of preprocessing data using different tools because every time i found each base of reads have good quality and no tag seauences and duplication so i do not understand what should i do ????
Thanks
If your reads have been scanned and trimmed (check if they are all the same length as length of sequencing run, if they are not then there is a good chance that they have been pre-trimmed). In that case you can move on to next step in analysis.
Tag sequences are relocated to the fastq header when Illumina reads are demultiplexed.
Yes, all reads have same length (150 bp) so i am thinking that adopters are not removed from sequencing data. So, first step should be remove adaptor sequence then quality filtering ?? am i right ???
It is not necessary that your data have adapter contamination (if you have exceptionally well made libraries).
You will only see adapter sequence on 3'-end of reads if (some of) your library inserts happen to be shorter than the length of sequencing. Can you run
bbduk.sh
based on the link I had posted in my first response above?