Hi,
I have transcriptome data of two sample with paired end chemistry. I have used Trimmomatic and NGS QC Toolkit for quality trimming. But after trimming data there are so many over represented sequence(50 bp length) present in both reads. Can anyone suggest me what procedure should be followed to remove these overrepresented sequences? I also checked that these overrepresented sequences are not adapters.
Have you tried blasting a few representative sequences @NCBI?
As long as they are from species of your interest you should be able to proceed with analysis. If they appear to be contaminants then you would want to investigate the extent of that contamination and then decide if the experiment needs to be repeated.
I used blast. some sequences show similarity with chloroplast genome sequence. But my sequence is from bacteria.
Some sequence also show similarity with distant related species of bacteria.
Should I remove these reads from fast file before moving toward assembly?
Chloroplasts are considered to have originated from cyanobacteria (that were engulfed by an eukaryotic cell) so that result in itself may not be surprising. Do you know what fraction of your reads represent data you know comes from the bacterium you are working with and what fraction goes into "other" (chloroplast etc) bin?
I hesitate to recommend that you throw any reads away without a full understanding of what bacterial species you are working with and what this experiment is about.
Are those reads by chance from rRNA sequences?