Hi everybody,
I have 8 fasta files each containing 100 sequences and I want to extract those sequences that are present in all 8 files, or thus eliminating those sequences that are only present in a subset of the 8 files. Sequences are identified by their genbank accession number, so I'm guessing it should be possible by extracting the accession numbers that are shared.
I was wondering whether there is an existing Perl script to do this?
Kind regards,
Sam
It may help if you could post a small sample of your input and what you expect as output.
Are you confident that the sequences sharing the same identifiers are the same sequence? If you aren't, you may need to calculate a checksum for each sequence (using something like md5) to be sure...