Question

How to extract sequence from fasta by sequence similarity

0

Entering edit mode

8.1 years ago

Janey ▴ 30

Hello I have two fasta files with different IDs which belongs to the two genotypes . first fasta file Consists of 100 contigs while second file include 100,000 contigs. I want to extranct the same contigs of first file from second file. I thank you for your suggestions.

Thank you

alignment • 2.7k views

ADD COMMENT • link updated 8.1 years ago by nuketbilgen ▴ 40 • written 8.1 years ago by Janey ▴ 30

0

Entering edit mode

Hi,

You can convert your 100,000 contig fasta to tsv using fasta_formatter from the FASTX-Toolkit

Then use grep with the --file option to supply your text file (list of 100 IDs) of patterns.

You cane use Galaxy, too.

ADD REPLY • link 8.1 years ago by Farbod ★ 3.4k

score 1 · Answer 1 · 2016-11-10

Assuming you have sequence as one string (Otherwise linearize both fasta files)

sed '/^>/d' file_1.fa | while read -r line; do grep -B 1 "$line" file_2.fa >> foo.res.txt; done

With this approach you don't need to worry if the headers are different for same contig in 2 files. If the header is same in two files, you can proceed with faSomeRecords as mentioned in other answers.

score 0 · Answer 2 · 2016-11-10

0

Entering edit mode

8.1 years ago

Sej Modha 5.3k

You can either BLAST them against each other or use a clustering program like cdhit to cluster identical sequences together.

ADD COMMENT • link 8.1 years ago by Sej Modha 5.3k

score 0 · Answer 3 · 2016-11-10

Dear Janey, Hi

You can create a list fo your 100 IDs (a text file, each ID in a new line, it is your listFile) and then use some script/tools same as faSomeRecords to extract the sequences of those IDs from the 100,000 contig file (which is now in.fa):

./faSomeRecords in.fa listFile out.fa

Hope I get your point correctly

~ Best

score 0 · Answer 4 · 2016-11-10

0

Entering edit mode

8.1 years ago

Janey ▴ 30

thanks for answers of my friends but i need tool or software that finally tell me: ID: 23 from file 1 has similar seuence to ID; 666 from file 2

ADD COMMENT • link 8.1 years ago by Janey ▴ 30

0

Entering edit mode

I think your title was not very clear ;-)

And the threshold of "similarity" is a problem here.

Are you searching for exact matches ?

ADD REPLY • link 8.1 years ago by Farbod ★ 3.4k

0

Entering edit mode

hi farbod yes about 98-100% similarity

ADD REPLY • link 8.1 years ago by Janey ▴ 30

0

Entering edit mode

Then just search the second file (think of it as "reference") using the first using any NGS aligner (and look for 100% matches?). bowtie v.1 may be the best tool if these are raw Illumina sequences.

ADD REPLY • link 8.1 years ago by GenoMax 147k

score 0 · Answer 5 · 2016-11-10

0

Entering edit mode

8.1 years ago

nuketbilgen ▴ 40

Hi, How about zipped fastq files? zgrep command is not working. :/

ADD COMMENT • link 8.1 years ago by nuketbilgen ▴ 40