Extracting specific sequences from a big fasta file using ids of the sequences to be excluded
4
0
Entering edit mode
9.2 years ago
hasche89 • 0

I have a huge fasta file of around 20 GB size. I also have some sequence IDS from the same fasta file in text format. Now, I want to retrieve those sequences which don't have those particular ids in the text file.

How shall I proceed? I use Ubuntu 12. I am a novice and have very little knowledge of bash, shell or perl. Any Linux or Samtools or Bioperl command will be helpful.

Thanks.

RNA-Seq samtools faidx bioperl perl • 5.3k views
ADD COMMENT
2
Entering edit mode
9.2 years ago
thackl ★ 3.0k

This would work:

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required

bin/SeqFilter big.fasta --ids idx.txt --ids-exclude --out big-filtered.fasta
ADD COMMENT
1
Entering edit mode
9.2 years ago

Simple way is to get a list of IDs that you would like to fetch from fasta. This could be done with 'grep' .

grep "^>" input.fasta | sed 's/>//' | grep -v - -f Ids.txt > retreive_IDs.txt

Then you could use something like pyfaidx or samtools

samtools faidx input.fasta `cat retreive_IDs.txt` 
ADD COMMENT
0
Entering edit mode

and also faSomeRecords

./faSomeRecords input.fa retreive_IDs.txt output.fa
ADD REPLY
0
Entering edit mode

Thanks for the commands.

I am a beginner in this field. Can you please tell me what does each component of your command does?

Thanks

ADD REPLY
0
Entering edit mode

Execute each command on your own, then you will understand very easily what each command is doing.

ADD REPLY
1
Entering edit mode
9.2 years ago

Boy, this really comes up a lot. Using the BBMap package:

filterbyname.sh in=file.fasta out=filtered.fasta names=names.txt include=f
ADD COMMENT
0
Entering edit mode

Always important to keep busy ;)

ADD REPLY

Login before adding your answer.

Traffic: 2495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6