How To Extract Sequences/Accession Numbers That Are Shared By A Number Of Fasta Files
1
0
Entering edit mode
11.4 years ago

Hi everybody,

I have 8 fasta files each containing 100 sequences and I want to extract those sequences that are present in all 8 files, or thus eliminating those sequences that are only present in a subset of the 8 files. Sequences are identified by their genbank accession number, so I'm guessing it should be possible by extracting the accession numbers that are shared.

I was wondering whether there is an existing Perl script to do this?

Kind regards,

Sam

perl fasta script genbank • 4.6k views
ADD COMMENT
0
Entering edit mode

It may help if you could post a small sample of your input and what you expect as output.

ADD REPLY
0
Entering edit mode

Are you confident that the sequences sharing the same identifiers are the same sequence? If you aren't, you may need to calculate a checksum for each sequence (using something like md5) to be sure...

ADD REPLY
1
Entering edit mode
11.4 years ago
cts ★ 1.7k

you can determine the set of sequences shared using unix tools:

grep -oP '(?<=\>)\S+' *fa | sort | uniq -c | awk '{if($1 == 8)print $2}'

you could then extract these sequences from anyone of the files using a variety of scripts. I use one called contig_extractor.pl that in its simplest form takes a list of identifiers and a fasta/fastq database and returns a subset of those sequences

ADD COMMENT

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6