Remove sequences from a big fasta file iteratively
3
0
Entering edit mode
9.1 years ago
RH • 0

I want to delete a sequence each time from a big fasta file. I got the following code from some online forum, it does not seem to work when I put it in a loop in bash though it works when I specify the sequence ID.

cat myfile.fasta | awk '{if (substr($0,1) == "${line}") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > ${line}.fasta

Can anyone help in this regard?

fasta • 2.6k views
ADD COMMENT
0
Entering edit mode

You can easily achieve it with Biopython SeqIO.

ADD REPLY
1
Entering edit mode

What is ${line}?

You should define variables for awk with -v, e.g

a=1337; awk -v myvar="$a" 'BEGIN{print myvar}'
1337
ADD REPLY
1
Entering edit mode
9.1 years ago
venu 7.1k

If you already have Seq IDs that should be deleted from the original file, you can do what @Ram suggested(creating subset from large file) with following oneliners

grep '^>' original_fasta.fa > original_ids.txt
grep -F -x -v -f ids_to_delete.txt original_ids.txt > ids_to_keep.txt

With the ids in ids_to_keep.txt file, you can extract the subset of sequences with faSomeRecords, take a look at the following link

A: perl code to extract sequences from multi-line fasta works on all test files but

ADD COMMENT
0
Entering edit mode
9.1 years ago
Ram 44k

Why do you wish to delete content from a file? From what I can guess, you either wish to extract a subset from a larger file or split the larger file into smaller files.

Also, the script above looks like it extracts sequence by ID to a new file - the ${line} is the variable containing the ID of the sequence. Therefore, loops would only work if you have a set of IDs the variable can iterate over.

ADD COMMENT
0
Entering edit mode

Yes, it is kind of subsetting the larger fasta file. Is there a better way of doing it with awk or a shell script?

I had set the seq IDs as the variable for the iteration but, somehow it doesn't seem to work with the above script.

ADD REPLY
0
Entering edit mode
9.1 years ago

You can use the --invert-match option with pyfaidx and a little bash:

$ pip install pyfaidx
$ faidx -v big.fasta $(tr '\n' ' ' < ids_to_remove.txt) > smaller.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6