I did a check of my fastq files using fastqc which reavealed several problems: 1) per base gc content, per base sequence content) at the intial part of the 100 bp paired end 2) several over represented sequences and kmer profiles. I then used trimmomatic to remove first 10 base pairs (headcrop 10) which showed some problems in the reads (is it so????) and also supplied Illumina adapters to remove the over represented sequences and kmer profiles using Illuminaclip. The report for overrepresented sequences has been good but the kmer profiles are still existing.
How should I remove those kmer profiles? Is it fine to go ahead and do the alignment to the reference genome without correcting for the kmers?
Thank yop in advance !
I wanted to share the pics/html files, I have got but I am not finding any options to share it on this forum. I am not sure why is that ? Are attachments not allowed on Biostars forum?
- Bishwa K.
Please upload things somewhere and link to them. Also, what kind of experiment was this (e.g., RNAseq)?
Hi Devon,
I have shared the link using google drive sharing. I think it will work after you download the link (on the browser). The data are genomic reseq data.
Thanks
I am attaching the link to the files that are available in html format. I think it will open on the browser after downloading.
This if the fasqc report for raw files (genomic resequenced data, paired end reads). It shows several problems: 1) per base gc content, per base sequence content) at the intial part of the 100 bp paired end 2) several over represented sequences and kmer profiles.
I then head cropped (10 bases) and removed adapter using trimmomatic
adapters: https://drive.google.com/file/d/0B9YUBnYGAr1AS0hrc2lMbE43ZUU/view?usp=sharing
https://drive.google.com/file/d/0B9YUBnYGAr1ANEFZc3FleDRob3M/view?usp=sharing
only adapter trimming improved the kmer profiles but not most of the sequence content and gc content per base at the first 10 bp of the read.
The new fastqc 0.32 reports kmer profiles for the fasta files that were not reported by fastqc (available on iplant).
Also, the RNAseq data has following fastqc report; no kmer and adapter contaminant but the gc and base content show more variation at the first 10 bp.
I am thinking of proceeding with adapter trimming but no head crop, but I would like to know why is there such variation at the first 10 base pairs of reads (for both RNAseq and genomic reseq data; they were both sequenced at different facilities).
Thanks
Can someone comment on my report?
Thanks,