Question

How To Extract Sequences/Accession Numbers That Are Shared By A Number Of Fasta Files

0

Entering edit mode

12.1 years ago

samlambrechts299 ▴ 170

Hi everybody,

I have 8 fasta files each containing 100 sequences and I want to extract those sequences that are present in all 8 files, or thus eliminating those sequences that are only present in a subset of the 8 files. Sequences are identified by their genbank accession number, so I'm guessing it should be possible by extracting the accession numbers that are shared.

I was wondering whether there is an existing Perl script to do this?

Kind regards,

Sam

perl fasta script genbank • 4.9k views

ADD COMMENT • link updated 12.1 years ago by cts ★ 1.7k • written 12.1 years ago by samlambrechts299 ▴ 170

0

Entering edit mode

It may help if you could post a small sample of your input and what you expect as output.

ADD REPLY • link 12.1 years ago by SES 8.6k

0

Entering edit mode

Are you confident that the sequences sharing the same identifiers are the same sequence? If you aren't, you may need to calculate a checksum for each sequence (using something like md5) to be sure...

ADD REPLY • link 12.1 years ago by sarahhunter ▴ 600

score 1 · Answer 1 · 2013-07-03

you can determine the set of sequences shared using unix tools:

grep -oP '(?<=\>)\S+' *fa | sort | uniq -c | awk '{if($1 == 8)print $2}'

you could then extract these sequences from anyone of the files using a variety of scripts. I use one called contig_extractor.pl that in its simplest form takes a list of identifiers and a fasta/fastq database and returns a subset of those sequences