You asked how to merge fasta files. You did not specify that they need to be uncompressed first. The link in your question describes the use of gunzip for uncompression, so I assumed that you had read the information and were comfortable with that part. Anyway, the answer is to first run "gunzip *.fa.gz".
But http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html use a different command. So which one is correct?
So can we say the code in the link http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html is wrong?
I use something like this shell script to get a single fasta (which, as @Neil says, is the same as .fa):
URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
rm -f hg18.fa
for chrom in `seq 1 22` X Y
do
wget -O - $URL/chr${chrom}.fa.gz | zcat -c >> hg18.fa
done
-2- fa and fasta are the same, but if you mean extract individual sequences from a multifasta file, you can use biopython. For example, I made this script:
import sys
from Bio import SeqIO
def IDfinder(fasta,ID):
f = open(fasta)
for seq_record in SeqIO.parse(f, "fasta"):
if seq_record.id==ID:
print ">" + seq_record.id + '\n' + seq_record.seq
f.close()
if __name__ == '__main__':
IDfinder(sys.argv[1],sys.argv[2])
To use it, you just have to do a copy/paste into a text file and save as IDextractor.py (or the name you want). Then you can use it to select a fasta by the ID into a multifasta file. For example:
get a multifasta file:
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/refMrna.fa.gz
$ gzip -d refMrna.fa.gz
then use the script
$ python IDextractor.py refMrna.fa NM_018864
>NM_018864
ggggttttaatggggcgggacttcctgtcggagcaatccccgttacctccggaagagccgaagaaccgagccctcggacgccggcggttgagcatcgatcgcggtgcgctcgcgcgagataatggcagacccttggcaggagtgcatggactatgcagtaatcctcgcgaggcaagctggagagatgattcgtgaagctttaaaaaatgagatggatgtcatgattaaaagttctccagccgacttggtaacagttactgaccaaaaagttgaaaaaatgctcatgtcttctataaaggaaaagtatccatgtcacagcttcattggtgaagagtctgtggcagctggggagaagacggtcttcacagagcagcccacgtgggtcattgaccccattgatggaacgactaacttcgtgcatcggtttccctttgtagctgtttcaattggcttccttgtgaataaagagatggagtttggaattgtgtacagctgtgtggaagataagatgtacaccggcaggaaagggaaaggtgccttttgtaacggtcagaagcttcaggtgtcccagcaggaagacattaccaagtcactcttggtgaccgagttgggctcgtccagaaagcccgagactttacggatcgttctctccaacatggaaaagctgtgttccatccccatccatggaatccggagtgttggaacagctgctgttaatatgtgccttgtggcaacgggaggagcagatgcctattatgagatgggaatccactgctgggacatggcgggagctggcatcattgtcaccgaggcaggcggagtgctcatggatgtcacgggtggaccgttcgatctgatgtctcggagaataattgccgcaaatagtataacattagccaaaagaatagccaaagaaattgagataatacctttgcaaagagacgacgaaagctagtcacagagaacagtgtccagctccagtgtcatccttgctgtccctggggtgtttcagatggatggtgtcactgatttagactgaactttgaggtcctgattttaaaatggaaactttttttttacagatgacatattcaaaattagatggaatatttgattattgaaagaaaatttgcatgtagtaatattcttggggaaaatatacaaaaagtatacttaatgaactagccattgaaattgtccctagtccttatgatccccttcaacttaatgtactgtttatatgcataattctcaattacaaagtttctttttgtaagtggctttctctatgttccagaagccatatttgattaagtctaaaggctgtaacaagctggctctccctgtgcagagggcctttgtgttttattaatcactgtaagatagtgcctggcccagtgcctgtcagacagtaggcagtctgaagtccacacctgacaatgcgtgctcgaagctgcagctgctgcctctaatgcgtcacagtaagataaccaccctcctgttgcgaggtagaagttacttcactgtcctttttatatttcttattgctatgccatttcacaggatcgtgctgccagagacgactgcttctagtggacatttctgcagttagtacactgctgtatgttgtaggttctgcttaaagctgccgtgctaaagagattttcacagacatcttccaggtacctggtctagttagtggcagggatatgttttacaaaaggcagctttctcattcagatccgtaccctggtgctgacctgtgtactgtggtgtaatggtgaactttttgatttctttccagacttgctgaatttcatcactgctaactctagatgctctctctataaggtcttgggcctctcaaactcaagaaaatttaatggctcctattcctttgttaaagggttaattcattgtctagccttggcccttggcatatgaacagatgttttgctcttagtatgtttgaaccttgcatttgatacaatgaagtgtttttgtaagtttcaaggcagttatcttgattttggggggatttaatatattaaagctatataatactcagatttgggcactgtaatgactatatctgtgctgttaattacatgtatttaaaacgtcacgtaccatgtaaattctattacaagacaggttgctttgcaattaaatttattttagttaagacttaggaataccattttctttcattgtattcatttgcgtatcccaggctgccctcagaattgttgcatacccgaggatgaacttgaacttgtgacggctctgcttttctctcttaagttctgggatgcagagaagatggccacaggccaccacacacagtttctgtggtgctggagactgcacagggccacacgtgtacttagcgtaagcactctgctgcccaagctgcgctccagcccatgaacacacgtggaattaaaggagtaattaatgatatcttatcaaagttaatagcctcagccctttttaggggttttgagtttagttacagatatttgaagctaatattggttatgaatattcactttttgcatatagattttcccactatagataaacacttaatactttccc
As you see, you have to call python, then writhe the name of the script, the name of your multifasta file and the ID of the sequence which are you looking for.
$ python IDextractor.py file.fa ID
And you MUST have biopython installed to run this script
$ sudo apt-get install python-biopython
-1-
To concatenate multiple fastas, as neilfws said, you can use cat. For example:
Get the fastas:
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*
unzip
$ wget gzip -d chr*.fa.gz
Concatenate into one fasta:
$ cat chr*.fa > allgenome.fa
note that > is used to redirect the screen output into a file.
I hope its helps. Cheers!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
As noted below .fasta = .fa = .fsa, see http://en.wikipedia.org/wiki/FASTA_format#File_extension