Bacterial refseq to remove contaminats
2
0
Entering edit mode
3.8 years ago

Hi, community. I have been working in a transcriptome for my species of interest which has an available genome. To increase my transcriptomic database, I decided, after assembling a genome-guided transcriptome, to assemble a de novo genome using the reads that did not map (around 10%~ of my data). However, I suspected that I had contamination. Indeed I mapped my dataset of non-aligning reads to several sequences (from human, Fungi, viral and bacterial), and for the Bacterial genome (E.coli) around 30% of the reads that did not map to my genome mapped to this bacterial genome. Since now I know my source of contamination is probably bacterial, I was wondering if there is any database I can use to map and remove the contaminants reads

Thank you in advance

transcritpome RNA-seq • 1.0k views
ADD COMMENT
0
Entering edit mode
3.8 years ago
GenoMax 147k

around 30% of the reads that did not map to my genome mapped to this bacterial genome.

There is no set database for contaminants. You don't really know if those reads came from E. coli to begin with but they seem have similarity to and are thus mapping to that genome. You may find that reads coming from basic metabolism genes in bacteria will map to multiple bacterial genomes equally well, especially if you are allowing for errors in alignment.

At some point you should set aside these suspected contaminant reads and go on with the transcriptome you have already put together. You probably have more interesting biology to discover there.

ADD COMMENT
0
Entering edit mode

There is no set database for contaminants. You don't really know if those reads came from E. coli to begin with but they seem have similarity to and are thus mapping to that genome. You may find that reads coming from basic metabolism genes in bacteria will map to multiple bacterial genomes equally well, especially if you are allowing for errors in alignment.

So it's enough to just use one bacterial genome and relax the parameters with my aligner? I've been using Hisat2 with default parameters like so:

hisat2 -p 4 -x db/ecoli_index -1 06_data_not_aligned/illumina/$sample\_R1.not_aligned.fastq.gz -2 06_data_not_aligned/illumina/$sample\_R2.not_aligned.fastq.gz

At some point you should set aside these suspected contaminant reads and go on with the transcriptome you have already put together. You probably have more interesting biology to discover there.

Since a lot of the reads I have mapped to the genome I'm sure a lot of interesting results will come up. However, we want to build a more complete transcriptome to be used in future studies.

ADD REPLY
0
Entering edit mode

However, we want to build a more complete transcriptome to be used in future studies.

It is easy for me to say this so apologies in advance but you will be best served by making additional libraries (perhaps from different life cycle stages/organs etc) rather than going after this small fraction of reads that did not map to your genome in first place.

ADD REPLY
0
Entering edit mode
3.7 years ago

Maybe this paper will help you :)

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank DOI: https://doi.org/10.1186/s13059-020-02023-1 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02023-1

ADD COMMENT

Login before adding your answer.

Traffic: 2157 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6