Question

How to extract contigs from FASTA file which contains specific sequence

3

Entering edit mode

9.8 years ago

Paul ★ 1.5k

Dear all,

Do you have any idea how to easy extract contigs from fasta file wich contains specific sequence?

For example:

My sequence:

ACCGTACCC

My FASTA:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC
>c1044
ATGAATAAAATAATTTTGTATCATAAATCGAGCTGTTAATTATT
>c1044
TTCATATTTGTAGCTAAGCAGAGGCGAAGCGTTCTTGTATCG

My output:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC

Thank you so much for any ideas and help.

fasta find extraction contig • 6.7k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Paul ★ 1.5k

0

Entering edit mode

Hello. Is there some way to do the same with biopython? Thanks

ADD REPLY • link 7.2 years ago by joselu ▴ 110

0

Entering edit mode

Please see @Devon Ryan answer for suggestions with biopython, or open a new question, with examples, and what you have tried.

ADD REPLY • link 7.2 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

This is not an answer. This should be a comment or a new post. If you're creating a new post, you should reference this post in addition to elaborating on what you've tried.

ADD REPLY • link 7.2 years ago by Ram 44k

1

Entering edit mode

7.2 years ago

Brian Bushnell 20k

Another option, using BBMap:

bbduk.sh in=file.fa out=unmatched.fa outm=matched.fa literal=ACCGTACCC mm=f k=9 rcomp=f

This optionally allows some number of mismatches, and matching reverse-complements (if rcomp=t), which are often helpful.

ADD COMMENT • link 7.2 years ago by Brian Bushnell 20k

score 4 · Accepted Answer · 2015-01-05

4

Entering edit mode

9.8 years ago

Ram 44k

You can use sed + grep (as suggested by NicoBxl and Devon) or BioPerl/BioPython (as suggested by Devon) or Heng Li's bioawk:

bioawk -c fastx '$seq ~ /ACCGTACCC/ { print ">"$name"\n"$seq; }' #might need a bit of tweaking

ADD COMMENT • link 2.6 years ago by Ram 44k

Ram · Accepted Answer · 2015-01-05

1

Entering edit mode

9.8 years ago

Nicolas Rosewick 11k

cat fasta.fa | grep -B1 "ACCGTACCC" > out.fa

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Nicolas Rosewick 11k

1

Entering edit mode

It should be noted that this won't work if there are multi-line entries (though one could use sed to reformat things to get around that).

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

of course. for that you could use fastx (fasta-formatter)

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Nicolas Rosewick 11k

Ram · Accepted Answer · 2015-01-05

1

Entering edit mode

9.8 years ago

Devon Ryan 104k

Use biopython or bioperl. With biopython, you could either use the re module or even just find() on the str() representation of each sequence. Either of these should be simple enough if you're familiar with either perl or python.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2015-01-05

1

Entering edit mode

9.8 years ago

Matt Shirley 10k

In this case I've changed the duplicate defline since pyfaidx requires unique sequence ids. You could mangle all the key names by passing your own key_function if you like.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Matt Shirley 10k