Hi all,
I've performed a Trinity assembly using a code similar to this one:
nohup /usr/local/bin/trinityrnaseq-v2.13.2/Trinity --seqType fq --max_memory 50G --left /usr/local/bin/trinityrnaseq-v2.13.2/sample_data/test_Trinity_Assembly/reads.left.fq.gz --right /usr/local/bin/trinityrnaseq-v2.13.2/sample_data/test_Trinity_Assembly/reads.right.fq.gz --CPU 6 &
The RNA sample is derived from blue crab tissue; however, it is enriched for RNA viruses with a special extraction protocol. In an ideal world, we would like for our resulting sample to contain only viral RNA. Nevertheless, about 25% of the sample still consists of blue crab (host) RNA. We figured this out by aligning the reads received from the sequencing company to the blue crab reference genome. Thus, I want to remove anything that resembles blue crab RNA sequences from the trinity_out_dir.Trinity.fasta file that Trinity produces (that contains all of the contigs within the assembly). Again, I have a blue crab reference genome, and I also have information about all of the reads that mapped to the blue crab reference genome using the Bowtie2 aligner (for example, I have a sorted.bam file for the alignment).
Given what I have, is there a way to remove most if not all of the RNA sequences that map back to the blue crab? I know that the Bowtie2 --un-conc argument allows you to write pairs that did not align to a reference, but I do not know how to incorporate this same concept in Trinity. Removing ALL host RNA or at least most from my assembly is the ultimate goal. Hope this makes sense- thanks in advance.
Relatively new to bioinformatics so sorry if I don't explain this too well
Hello, thank you so much for your response. I am now running a blastn of the sequences to NCBI nt. Here is an example of my code:
In an effort to reduce the tremendous amount of data that we have, I've limited the search to the #1 match. In the past, another student in my lab aligned the sequences using blastx to the nr database so I have code for that as well. She did the top 5 matches:
I have an excel sheet of the blastx results (blastn is still running) and when I sort based on percent identity, my best hits are to bacterial organisms or different crab species. Given that I don't want any host RNA in there or even any other microbes that are not viruses, I'll likely want to remove these transcripts. The thing is, I don't know how to automate the process. I know that there are plenty of non-viral sequences in my assembly, and I'd like to remove contamination by starting off with host RNA (and then bacteria, protozoans, and so on) but I do not know how to incorporate that into my code. I've never used NCBI taxonomy but can try to work with it now. Any example codes will be greatly appreciated, thank you so much.
Blastn, as well as blastx, has an option "sskingdom". It will output the superkingdom that the match belongs to. Then, you can open the produced table in Excel, sort it by superkingdoms and make a list of sequences that don't belong to Viruses. Then, you can remove these sequences from the FASTA file as suggested at How to delete seq_id + sequence from FASTA file matching specific seq_id .
Some additional notes:
1) I recommend to use "-max_hsps 1" in addition to "-max_target_seqs 1". Otherwise, if a transcript has several matches to different regions of a single sequence in NCBI nt, all these matches will be written separately.
2) You may accelerate the alignment if instead of taking all isoforms for each transcript you take only the major isoform. For example, you can find CDSs by Transdecoder and call the isoform with the longest CDS the major.
3) Considering top 5 matches instead of the best match may increase the accuracy of contamination removal. See my recent update to the first comment at How can I remove contaminants from an assembled genome? . Sometimes, though not often, I see sequences where the first match belongs to one taxon, but the other four matches to a very different taxon. This probably means incorrect taxonomic annotations of some sequences in NCBI nt.