Hi,
I have assembled a de novo transcriptome from my species of interest. The species has an internal symbiont (algae) and those sequences obviously we sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?
The assembly contains ~350,000 contigs.
Thanks
Thanks for the advice,
Unfortunately I'm having issues trying to install FAST on my system (issues with lib paths, etc). Is there another way to do this using shell scripts, basic perl or python?
With the fastx_toolkit, you can format your fasta-files in such a way, that you have a tab-separated format:
From this, you can easily use the standard grep and reformat it with awk/perl afterwards.
Hi,
I am having an issue with this. I finally managed to get the FAST module installed and working so I tried your script.
I recovered the IDs from the BLAST output using cut to give the following (I'm using your file names for simplicity's sake)
There are (
cat id-list.txt |wc -l
) = 17274 linesI then used this as my
id-list.txt
in the code you suggested aboveThis script runs and runs and eventually I killed it and it gave the following files as output:
As you can see the output files are considerably larger than the input file and contain many more (obviously repeated) sequences!
I'm assuming that the problem is with the IDs as contained in the
id-list.txt
file. What format do they need to be in? -I have them as a single column. They are unique for each contig in themy_contig.fa
file.Help!
To further clarify, my contigs are the output of a trinity run and so the fasta files have the following format:
(Sequence removed for the sake of clarity).
I'm sorry, the second command will duplicate the
my_contig.fa
in each step (except the entry of the ID in the loop). Thus, this will not work.The problem with your IDs might be that the '|' could be interpreted as a logical or. Searching in case of
TR31996|c0_g1_i1
forTR31996
orc0_g1_i1
. Try to replace the|
with a non-regex character, like_
.If this works, the next step would be to generate a second ID list with all the IDs, which were not part of the first ID list.
I'm not sure what you are saying here? Are you saying that the script as written:
will not work. Or that it will work if I replace the '|' in my IDs. I've already removed all the extraneous info other than the unique ID using
sed
so it should be simple to replace it with another character. I would prefer to use-
rather than_
given that_
is already being used in the name. If I do this should the script work?Looping over the id-list and using
fasgrep -v
each time will generate a very huge file. Each iteration will add the whole file but one entry. Thus, the resulting file would be 17274 times 278MB minus once the entries from the id list. Therefore:will work.
Replacing the
|
with a-
worked in your given example (with the first fasgrep).Hi,
Yep, I figured out the problem with the
-v
option. I managed to great an id list with just the host by using a combination ofcat
,sort
anduniq
commands and also replaced the|
with-
(and simplified the ID line down by removing everything except the unique ID. If nothing else this exercise has made me much better withsed
...The issue I'm finding is the length of time this process is taking. I know it is not multi-threaded but each command has been running for over 24hrs on a separate core and has only sorted 6-7,000 items each. At this rate the full file will take 27 days to run! I can split the files and run them concurrently but there must be a more efficient way to run this?