Question

Maximum Over-Represented Sequence in Bacterial Transcriptome Data

0

Entering edit mode

8.6 years ago

debashis.bioinfo • 0

Hi,

I have transcriptome data of two sample with paired end chemistry. I have used Trimmomatic and NGS QC Toolkit for quality trimming. But after trimming data there are so many over represented sequence(50 bp length) present in both reads. Can anyone suggest me what procedure should be followed to remove these overrepresented sequences? I also checked that these overrepresented sequences are not adapters.

RNA-Seq • 2.1k views

ADD COMMENT • link updated 8.6 years ago by Govardhan Anande ▴ 150 • written 8.6 years ago by debashis.bioinfo • 0

0

Entering edit mode

Have you tried blasting a few representative sequences @NCBI?
As long as they are from species of your interest you should be able to proceed with analysis. If they appear to be contaminants then you would want to investigate the extent of that contamination and then decide if the experiment needs to be repeated.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

I used blast. some sequences show similarity with chloroplast genome sequence. But my sequence is from bacteria.

Some sequence also show similarity with distant related species of bacteria.

Should I remove these reads from fast file before moving toward assembly?

ADD REPLY • link 8.5 years ago by debashis.bioinfo • 0

1

Entering edit mode

Chloroplasts are considered to have originated from cyanobacteria (that were engulfed by an eukaryotic cell) so that result in itself may not be surprising. Do you know what fraction of your reads represent data you know comes from the bacterium you are working with and what fraction goes into "other" (chloroplast etc) bin?
I hesitate to recommend that you throw any reads away without a full understanding of what bacterial species you are working with and what this experiment is about.

ADD REPLY • link 8.5 years ago by GenoMax 147k

0

Entering edit mode

Are those reads by chance from rRNA sequences?

ADD REPLY • link 8.5 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2016-05-04

Hi,

If your data is RNA seq, then you don't have to worry about it. Because it might be the expressed genes.

For example: lets say your read length is 100bp and one gene of length 100 bp got expressed 10 times then there will be 10 reads covering the gene which might end up getting it in over represented seq

Correct me if I am wrong.

Cheers