How to extract contigs from FASTA file which contains specific sequence
5
3
Entering edit mode
10.0 years ago
Paul ★ 1.5k

Dear all,

Do you have any idea how to easy extract contigs from fasta file wich contains specific sequence?

For example:

My sequence:

ACCGTACCC

My FASTA:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC
>c1044
ATGAATAAAATAATTTTGTATCATAAATCGAGCTGTTAATTATT
>c1044
TTCATATTTGTAGCTAAGCAGAGGCGAAGCGTTCTTGTATCG

My output:

>c1042
ACCGTACCC
>c1043
GCTACAGTTGAAAGGGGACCGTACCC

Thank you so much for any ideas and help.

fasta find extraction contig • 6.8k views
ADD COMMENT
0
Entering edit mode

Hello. Is there some way to do the same with biopython? Thanks

ADD REPLY
0
Entering edit mode

Please see @Devon Ryan answer for suggestions with biopython, or open a new question, with examples, and what you have tried.

ADD REPLY
0
Entering edit mode

This is not an answer. This should be a comment or a new post. If you're creating a new post, you should reference this post in addition to elaborating on what you've tried.

ADD REPLY
4
Entering edit mode
10.0 years ago
Ram 44k

You can use sed + grep (as suggested by NicoBxl and Devon) or BioPerl/BioPython (as suggested by Devon) or Heng Li's bioawk:

bioawk -c fastx '$seq ~ /ACCGTACCC/ { print ">"$name"\n"$seq; }' #might need a bit of tweaking
ADD COMMENT
1
Entering edit mode
10.0 years ago
cat fasta.fa | grep -B1 "ACCGTACCC" > out.fa
ADD COMMENT
1
Entering edit mode

It should be noted that this won't work if there are multi-line entries (though one could use sed to reformat things to get around that).

ADD REPLY
0
Entering edit mode

of course. for that you could use fastx (fasta-formatter)

ADD REPLY
1
Entering edit mode
10.0 years ago

Use biopython or bioperl. With biopython, you could either use the re module or even just find() on the str() representation of each sequence. Either of these should be simple enough if you're familiar with either perl or python.

ADD COMMENT
1
Entering edit mode
10.0 years ago

In this case I've changed the duplicate defline since pyfaidx requires unique sequence ids. You could mangle all the key names by passing your own key_function if you like.

ADD COMMENT
1
Entering edit mode
7.4 years ago

Another option, using BBMap:

bbduk.sh in=file.fa out=unmatched.fa outm=matched.fa literal=ACCGTACCC mm=f k=9 rcomp=f

This optionally allows some number of mismatches, and matching reverse-complements (if rcomp=t), which are often helpful.

ADD COMMENT

Login before adding your answer.

Traffic: 2055 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6