Maximum Over-Represented Sequence in Bacterial Transcriptome Data
1
0
Entering edit mode
8.6 years ago

Hi,

I have transcriptome data of two sample with paired end chemistry. I have used Trimmomatic and NGS QC Toolkit for quality trimming. But after trimming data there are so many over represented sequence(50 bp length) present in both reads. Can anyone suggest me what procedure should be followed to remove these overrepresented sequences? I also checked that these overrepresented sequences are not adapters.

RNA-Seq • 2.1k views
ADD COMMENT
0
Entering edit mode

Have you tried blasting a few representative sequences @NCBI?
As long as they are from species of your interest you should be able to proceed with analysis. If they appear to be contaminants then you would want to investigate the extent of that contamination and then decide if the experiment needs to be repeated.

ADD REPLY
0
Entering edit mode

I used blast. some sequences show similarity with chloroplast genome sequence. But my sequence is from bacteria.

Some sequence also show similarity with distant related species of bacteria.

Should I remove these reads from fast file before moving toward assembly?

ADD REPLY
1
Entering edit mode

Chloroplasts are considered to have originated from cyanobacteria (that were engulfed by an eukaryotic cell) so that result in itself may not be surprising. Do you know what fraction of your reads represent data you know comes from the bacterium you are working with and what fraction goes into "other" (chloroplast etc) bin?
I hesitate to recommend that you throw any reads away without a full understanding of what bacterial species you are working with and what this experiment is about.

ADD REPLY
0
Entering edit mode

Are those reads by chance from rRNA sequences?

ADD REPLY
0
Entering edit mode
8.6 years ago

Hi,

If your data is RNA seq, then you don't have to worry about it. Because it might be the expressed genes.

For example: lets say your read length is 100bp and one gene of length 100 bp got expressed 10 times then there will be 10 reads covering the gene which might end up getting it in over represented seq

Correct me if I am wrong.

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 1574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6