Fasta file filtering
2
0
Entering edit mode
6.5 years ago
Janey ▴ 30

Hi my friends

Maybe my question is very simple, but, I'm not familiar with the programming language. I use the following script to extract sequence from fasta file. How can I write similar command to remove the sequences by sequence IDs from fasta file.

cut -c 2- ID.text | xargs -n 1 samtools faidx in.fasta > out.fasta

Thanks for your help

RNA-Seq • 10k views
ADD COMMENT
0
Entering edit mode

are these fasta files flattened i.e is the sequence in a single line after each ID? Then you can use: grep -A 1 -w <ID> input.fasta

eg: output:

$ grep -A 1 -w 'cde' test.fa
>cde
atgcatgcNNN

input:

$ cat test.fa
>abc
agtgcNNNN
>cde
atgcatgcNNN
ADD REPLY
0
Entering edit mode

hi cpad0112

my fasta file is like this:

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

What is the solution to my problem??

ADD REPLY
0
Entering edit mode

do you want to remove the duplicate sequences? Botht the sequences look duplicate to me: if you want to remove dups, $ seqkit rmdup --quiet -i -n test.fa. If it is an example file, then try $ grep -A 1 -w '> c50249_g1_i3' input.fa. This assumes that sequence is in a single row next to id.

ADD REPLY
0
Entering edit mode

I have one IDs list that want to remove their related sequences. and my fasta file is like that with several lines. now, do you have command for me??

ADD REPLY
0
Entering edit mode
$ seqkit grep -iv -f IDs.txt input.fa
ADD REPLY
1
Entering edit mode
6.5 years ago

Hello Janey,

you could use seqkit for this task.

$ seqkit grep -v -n -f id_list.txt in.fasta > out.fasta

fin swimmer

ADD COMMENT
0
Entering edit mode

hi finswimmer

I love your name. I downloaded the seqkit and unzied it but did not worked. You do not have an idea to activate it??

ADD REPLY
0
Entering edit mode

Does anyone have a simpler solution to this problem ????? Does anyone hear my voice ?????

ADD REPLY
1
Entering edit mode

Hello Janey,

please be more patient. We all doing this here in our free time. So don't expect to get a ready-to-use-solution within some minutes.

Which file do you downloaded from seqkit? What platform are you using (windows, linux distribution ...)? What are you meaning with "but did not worked"? What have you done and what was the result of your action?

fin swimmer

ADD REPLY
0
Entering edit mode

i download "seqkit_linux_amd64 (1).tar.gz" file for linux and unzip it.

I run this command:

./seqkit grep -v -n -f Tran_Cod.txt Totalassembly.fasta > out.fasta

my output file was empty.

ADD REPLY
1
Entering edit mode

If your output file is empty and you didn't get any error message means that all id's in your fasta matches to the id's in your Tran_Cod.txt. Could this be?

What happens if you remove the -v?

Could please post the output of head Tran_Cod.txt?

ADD REPLY
0
Entering edit mode

@janey: what is your OS? which version did you download? One doesn't have to activate this program.

ADD REPLY
0
Entering edit mode
6.5 years ago
erwan.scaon ▴ 950

You can easily achieve this with seqtk.

Quoting from the manual :

Extract sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq/fa name.lst > out.fq/fa

ADD COMMENT

Login before adding your answer.

Traffic: 2639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6