Question

Remove sequences from a big fasta file iteratively

0

Entering edit mode

9.5 years ago

RH • 0

I want to delete a sequence each time from a big fasta file. I got the following code from some online forum, it does not seem to work when I put it in a loop in bash though it works when I specify the sequence ID.

cat myfile.fasta | awk '{if (substr($0,1) == "${line}") censor=1; else if (substr($0,1,1) == ">") censor=0; if (censor==0) print $0}' > ${line}.fasta

Can anyone help in this regard?

fasta • 2.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by RH • 0

0

Entering edit mode

You can easily achieve it with Biopython SeqIO.

ADD REPLY • link 9.5 years ago by Pappu ★ 2.1k

1

Entering edit mode

What is ${line}?

You should define variables for awk with -v, e.g

a=1337; awk -v myvar="$a" 'BEGIN{print myvar}'
1337

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by 5heikki 11k

Ram · Answer 1 · 2015-10-26

If you already have Seq IDs that should be deleted from the original file, you can do what @Ram suggested(creating subset from large file) with following oneliners

grep '^>' original_fasta.fa > original_ids.txt
grep -F -x -v -f ids_to_delete.txt original_ids.txt > ids_to_keep.txt

With the ids in ids_to_keep.txt file, you can extract the subset of sequences with faSomeRecords, take a look at the following link

A: perl code to extract sequences from multi-line fasta works on all test files but

Ram · Answer 2 · 2015-10-25

0

Entering edit mode

9.5 years ago

Ram 45k

Why do you wish to delete content from a file? From what I can guess, you either wish to extract a subset from a larger file or split the larger file into smaller files.

Also, the script above looks like it extracts sequence by ID to a new file - the ${line} is the variable containing the ID of the sequence. Therefore, loops would only work if you have a set of IDs the variable can iterate over.

ADD COMMENT • link 9.5 years ago by Ram 45k

0

Entering edit mode

Yes, it is kind of subsetting the larger fasta file. Is there a better way of doing it with awk or a shell script?

I had set the seq IDs as the variable for the iteration but, somehow it doesn't seem to work with the above script.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by RH • 0

Ram · Answer 3 · 2015-10-26

0

Entering edit mode

9.5 years ago

Matt Shirley 10k

You can use the --invert-match option with pyfaidx and a little bash:

$ pip install pyfaidx
$ faidx -v big.fasta $(tr '\n' ' ' < ids_to_remove.txt) > smaller.fasta

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Matt Shirley 10k