Question

Fasta file filtering

0

Entering edit mode

6.5 years ago

Janey ▴ 30

Hi my friends

Maybe my question is very simple, but, I'm not familiar with the programming language. I use the following script to extract sequence from fasta file. How can I write similar command to remove the sequences by sequence IDs from fasta file.

cut -c 2- ID.text | xargs -n 1 samtools faidx in.fasta > out.fasta

Thanks for your help

RNA-Seq • 10k views

ADD COMMENT • link updated 6.5 years ago by erwan.scaon ▴ 950 • written 6.5 years ago by Janey ▴ 30

0

Entering edit mode

are these fasta files flattened i.e is the sequence in a single line after each ID? Then you can use: grep -A 1 -w <ID> input.fasta

eg: output:

$ grep -A 1 -w 'cde' test.fa
>cde
atgcatgcNNN

input:

$ cat test.fa
>abc
agtgcNNNN
>cde
atgcatgcNNN

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

hi cpad0112

my fasta file is like this:

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

c50249_g1_i2

ATCAAAAATAGTTTCAGTTTGTGAAAAAATTCATTCTCTCAATCTCTTGTTTTACTTTTG AATTATAAACTCGAGGCAAAGAAAAATGTTCATTCAAGAAGATTGATACCCAGTGTGCTC ATAATGAAGAAGAAATTCGAAAGTAGAACATTCGGTATTTGGTCAAAGAAGAAATGTCGT CTTCTCAAGTTACAGCTGTTCCAACTTCACCACCAAAAACTATTCGTCATACTGCTGATT TCCATCCCACAATATGGGGAGATCAATTCCTCAAAGATACTTCTGAACTCAAGGTAATTA CACGTACACACTCTATATGAGATTAAATTTCTAAATGCACCCACCCTATGCATGCACATC ACAATAAACG

What is the solution to my problem??

ADD REPLY • link 6.5 years ago by Janey ▴ 30

0

Entering edit mode

do you want to remove the duplicate sequences? Botht the sequences look duplicate to me: if you want to remove dups, $ seqkit rmdup --quiet -i -n test.fa. If it is an example file, then try $ grep -A 1 -w '> c50249_g1_i3' input.fa. This assumes that sequence is in a single row next to id.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

I have one IDs list that want to remove their related sequences. and my fasta file is like that with several lines. now, do you have command for me??

ADD REPLY • link 6.5 years ago by Janey ▴ 30

0

Entering edit mode

$ seqkit grep -iv -f IDs.txt input.fa

ADD REPLY • link 6.5 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-06-06

1

Entering edit mode

6.5 years ago

finswimmer 16k

Hello Janey,

you could use seqkit for this task.

$ seqkit grep -v -n -f id_list.txt in.fasta > out.fasta

fin swimmer

ADD COMMENT • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

hi finswimmer

I love your name. I downloaded the seqkit and unzied it but did not worked. You do not have an idea to activate it??

ADD REPLY • link 6.5 years ago by Janey ▴ 30

0

Entering edit mode

Does anyone have a simpler solution to this problem ????? Does anyone hear my voice ?????

ADD REPLY • link 6.5 years ago by Janey ▴ 30

1

Entering edit mode

Hello Janey,

please be more patient. We all doing this here in our free time. So don't expect to get a ready-to-use-solution within some minutes.

Which file do you downloaded from seqkit? What platform are you using (windows, linux distribution ...)? What are you meaning with "but did not worked"? What have you done and what was the result of your action?

fin swimmer

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

i download "seqkit_linux_amd64 (1).tar.gz" file for linux and unzip it.

I run this command:

./seqkit grep -v -n -f Tran_Cod.txt Totalassembly.fasta > out.fasta

my output file was empty.

ADD REPLY • link 6.5 years ago by Janey ▴ 30

1

Entering edit mode

If your output file is empty and you didn't get any error message means that all id's in your fasta matches to the id's in your Tran_Cod.txt. Could this be?

What happens if you remove the -v?

Could please post the output of head Tran_Cod.txt?

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

@janey: what is your OS? which version did you download? One doesn't have to activate this program.

ADD REPLY • link 6.5 years ago by cpad0112 21k

score 0 · Answer 2 · 2018-06-06

0

Entering edit mode

6.5 years ago

erwan.scaon ▴ 950

You can easily achieve this with seqtk.

Quoting from the manual :

Extract sequences with names in file name.lst, one sequence name per line:
seqtk subseq in.fq/fa name.lst > out.fq/fa

ADD COMMENT • link 6.5 years ago by erwan.scaon ▴ 950