Hi all!
I am sure I am not the first person with this problem and do not want to reinvent the wheel here, therefor I would like to know if someone has a solution for the following problem:
Based on the contamination report I got from the NCBI contamination screen after trying to upload my transcriptome assembly there are some sequences that need to be trimmed. One example from the report looks like this:
"sequence name" 350 1..109 Pseudomonas putida W619
Meaning that bp 1 - 109 is of suspicious origin and needs to be trimmed from this sequence that is 350 bp in total. Is there a program or script known to you that reads in both the contamination report (can be edited so it only includes lines as the example) and the fasta file and spits out a fasta with some of the secuences trimmed?
Or do I need to write one myself (not so experienced)..?
Thanks in advance!
NCBI's contamination screen report is probably done using some kind of automated pipeline. Do you have reason to believe that observation in it are true/real? It is possible that your organism may have the same sequence in its genome.
I do not have a reference genome so I can never be certain about this, only very careful. My organism is an insect so if a certain region is hitting a Pseudomonas I am quit critical and I would trim this down. Also, when I ran some of these reported sequences randomly through blast I see that region 110-350 is nicely aligned to an insect gene, but the rest Pseudomonas. And this is for all the suspicious reads. Regions that were marked as "need to be deleted completely" are aligning to something different than an insect region when blasting, can be E. coli, mouse, human. I took the safe side here and deleted these completely.
Just for clarity, this is transcriptome data (RNAseq), meaning that some of the assembled transcripts are not necessarily full assembled genes.
Now I just picked out a sequence with a region hitting Pseudomonas. Fact is that the rest of the region does hit the insect's species gene. Furthermore, I also see that especially the regions that are advised to be deleted completely are really short transcripts (~200 bp) and not close to a full gene. Probably they are hitting another species, vector or even strange (not used) adapters simply due to chance..
I always like to stay on the safe side and rather delete suspicious reads (that are apparently not picked up as insect's own anyway, hence useless for the transcriptome analyses)..
But what would you advice..? And my original question still stands, is there a program out there where I can trim a predefined region from a batch of sequences? (with the region being different for each sequence..)
Thanks!
To answer your question you can create a bed intervals file that contains information in this format (by cutting out relevant info from your report, example below from your original)
Plus
Once you have these you can use
samtools faidx
,bedtools getfasta
orpyfaidx
with the intervals and your original dataset to extract just the sequences you want to keep.Can you tell me how to solve this feedback from NCBI as shown below?
I have trimmed adapters and removed contaminants and then mapped/splited fasta file as flowing command line .
Thank you in advance.
SUBID BioProject BioSample Organism
[] We ran your sequences through our Contamination Screen. The screen found contigs that need to be trimmed and/or excluded. Please adjust the sequences appropriately and then resubmit your sequences. After you remove the contamination, trim any Ns at the ends of the sequence and remove any sequences that are shorter than 200 nt and not part of a multi-component scaffold.
Note that hits in eukaryotic genomes to mitochondrial sequences can be ignored when specific criteria are met. Those criteria are explained below.
Note that mismatches between the name of the adaptor/primer identified in the screen and the sequencing technology used to generate the sequencing data should not be used to discount the validity of the screen results as the adaptors/primers of many different sequencing platforms share sequence similarity.
[] Some of the sequences hit primers or adaptors used in Illumina or 454 or other sequencing strategies or platforms. Adaptor at the end of a sequence should be removed. However, if adaptors are present within sequences then you should strongly consider splitting the sequences at the adaptors because the adaptor sequence could have been the region of overlap, causing a misassembly.
Skipped 2 same as before; no new sequences to screen. 2 sequences with locations to mask/trim
Trim: Sequence name, length, span(s), apparent source
There is not enough information in this post for us to be able to help you. It looks like you seem to have mostly done the right thing with various
bbmap
commands but we have no idea what exactly you are doing. There is one exception.That command does not make sense. Why would you use
.fa
format for alignments, if that is what you are doing.Are you trying to submit just plain fasta data through the genome submission?
Yes, I am trying to submit plain fasta data through the genome submission to NCBI.
I submitted the genome using the following tree step yesterday and I am waiting for the feedback.
Where did this
genome
come from? i.e.r.fasta
. Did you assemble it yourself? If you did assemble the genome yourself, the trimming adapters/removing contaminants is the first step that needs to be done on raw data before one does an assembly.I received the following feedback from NCBI as shown below.
Can you tell me your suggestions ?
Thank you for your GenBank submission.
Before we continue to process the genome(s), we would like to know why this genome was marked as not circular.
Most bacterial genomes have a circular chromosome. Did you mark this as not circular because:
[1] there is a gap at the end between the last and first base
[2] the genome assembly was not circularized
[3] any possible overlap at the ends was not trimmed?
This information will help us design appropriate questions for the submission portal.
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.SUBMIT ANSWER
is for new answers to original question.You have not answered questions I had asked in my responses. So there is not else to say.
You need to answer the questions NCBI is asking. We have no idea what exactly you did and what this data is from/about.