Question

Subsetting fasta files based on blast results

0

Entering edit mode

9.3 years ago

dpearton • 0

Hi,

I have assembled a de novo transcriptome from my species of interest. The species has an internal symbiont (algae) and those sequences obviously we sequenced along with it. I have done a blast on my assembly which has ID'd a number of contigs corresponding to the symbiont. I would like to use the results of the blast to remove those contigs from the assembly into a new file so I can deal with the two organisms separately. It there an easy way to do this?

The assembly contains ~350,000 contigs.

Thanks

RNA-Seq blast • 2.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.3 years ago by dpearton • 0

Ram · Answer 1 · 2015-09-23

0

Entering edit mode

9.3 years ago

michael.ante ★ 3.9k

Hi,

You can store the IDs of interest in a file and loop over them using fasgrep (from FAST):

for I in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

Cheers,
Michael

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.3 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Thanks for the advice,

Unfortunately I'm having issues trying to install FAST on my system (issues with lib paths, etc). Is there another way to do this using shell scripts, basic perl or python?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by dpearton • 0

0

Entering edit mode

With the fastx_toolkit, you can format your fasta-files in such a way, that you have a tab-separated format:

fasta_formatter -t -i my_contig.fa -o my_contig.tsv

Input Example:
   >MY-ID
   AAAAAGGGGG
   CCCCCTTTTT
   AGCTN

Output example with tabular output [-t]:
   MY-ID        AAAAAGGGGGCCCCCTTTTAGCTN

From this, you can easily use the standard grep and reformat it with awk/perl afterwards.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Hi,

I am having an issue with this. I finally managed to get the FAST module installed and working so I tried your script.

I recovered the IDs from the BLAST output using cut to give the following (I'm using your file names for simplicity's sake)

$ head -5 id-list.txt
TR75917|c0_g1_i3
TR31996|c0_g1_i1
TR18920|c0_g1_i2
TR90390|c0_g1_i1
TR42315|c0_g1_i1

There are (cat id-list.txt |wc -l) = 17274 lines

I then used this as my id-list.txt in the code you suggested above

for I in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

This script runs and runs and eventually I killed it and it gave the following files as output:

-rw-r--r-- 1  707115619 Sep 30 15:06 host.fasta
-rw-r--r-- 1     291823 Sep 30 15:01 id-list.txt
-rw-r--r-- 1  292214587 Sep 30 15:00 my_contig.fa
-rw-r--r-- 1  161001520 Sep 30 15:05 symbiont.fasta

As you can see the output files are considerably larger than the input file and contain many more (obviously repeated) sequences!

I'm assuming that the problem is with the IDs as contained in the id-list.txt file. What format do they need to be in? -I have them as a single column. They are unique for each contig in the my_contig.fa file.

Help!

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by dpearton • 0

0

Entering edit mode

To further clarify, my contigs are the output of a trinity run and so the fasta files have the following format:

>TR1|c0_g1_i1 len=581 path=[584:0-374 585:375-398 586:399-580] [-1, 584, 585, 586, -2]
AGCTGTTTGGCCAAGGCTGCGGCCTGGTGGCAGCCTTGCGAGAGCAAGGGCAGCAAGGGC
AGCAAGGTGCTCCCCGAATCGCTTCCGGCGAGTCCTAGGAGTCCCTGCAGTCAGCGCAAG
>TR2|c0_g1_i1 len=324 path=[327:0-99 328:100-123 329:124-323] [-1, 327, 328, 329, -2]
TTTCAAAGTTTTGCGGAAAACCGCAAAGTTTCCCTAGAATTTTCGGTGCTCACTTTTATC

(Sequence removed for the sake of clarity).

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by dpearton • 0

0

Entering edit mode

I'm sorry, the second command will duplicate the my_contig.fa in each step (except the entry of the ID in the loop). Thus, this will not work.

The problem with your IDs might be that the '|' could be interpreted as a logical or. Searching in case of TR31996|c0_g1_i1 for TR31996 or c0_g1_i1 . Try to replace the | with a non-regex character, like _.

If this works, the next step would be to generate a second ID list with all the IDs, which were not part of the first ID list.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

I'm not sure what you are saying here? Are you saying that the script as written:

for I in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; fasgrep -v $i my_contig.fa >> host.fasta ; done

will not work. Or that it will work if I replace the '|' in my IDs. I've already removed all the extraneous info other than the unique ID using sed so it should be simple to replace it with another character. I would prefer to use - rather than _ given that _ is already being used in the name. If I do this should the script work?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by dpearton • 0

0

Entering edit mode

Looping over the id-list and using fasgrep -v each time will generate a very huge file. Each iteration will add the whole file but one entry. Thus, the resulting file would be 17274 times 278MB minus once the entries from the id list. Therefore:

for I in $(cat id-list.txt); do fasgrep $i my_contig.fa >> symbiont.fasta ; done

will work.

Replacing the | with a - worked in your given example (with the first fasgrep).

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Hi,

Yep, I figured out the problem with the -v option. I managed to great an id list with just the host by using a combination of cat, sort and uniq commands and also replaced the | with - (and simplified the ID line down by removing everything except the unique ID. If nothing else this exercise has made me much better with sed...

The issue I'm finding is the length of time this process is taking. I know it is not multi-threaded but each command has been running for over 24hrs on a separate core and has only sorted 6-7,000 items each. At this rate the full file will take 27 days to run! I can split the files and run them concurrently but there must be a more efficient way to run this?

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by dpearton • 0