Hello,
I wish to filter from a big FASTA file only sequences related to chrs 1-22, X and Y. An example of FASTA sequence is:
>ENSG00000119314|ENST00000210227|PTBP3|9|-1|115024785
GGGTGGCAGGTGCCTGTAATCCCAGCTACTCCAGAGGCTGAGGCAGGGGAATTGCTTGAG
CCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATTGTGCCACTGCACTCCAGCCTGGAGTCT
CACTTTGTCACACAGGGTGGAGTGCAGTGGTGTGATCTCGGCTCACTGCAACCTCTGCTT
ACCGGGTTGAGATTCTCCTGTCTCAACCTCCTGAGTAGCTGGGATTACAGGCGTGCACCA
CCAAGCCAGACTAATTTTCCTATTTTTAGTAGAGATGGGGGGTTTCACCATGTTGGCCAG...
>ENSG00000236011|ENST00000211377|GPANK1|HSCHR6_MHC_COX|-1|31616421
CCCTATTCCTACCTAACCTCCCCTCAGGACTCAGGCTCCAATGTGTTGAGCCCCAACTCC
TTCCCATAAGACTGCCACACGGTGCTTTCCTTTCCCTTCTTCAACACTCACCAATGGGAA
GCATTGGCTGGTTCTCACAGTACACACGAGGACAGTAACCAAAGTCTCCTTGCTGGTACT
TTTCCAACTGAGGTGAATACAATGGAAGGGGTTGGCAGGTAGATGTAAAGAAGAGGCAAC
TCCCTTCGCAGCCCAACCCATACCACTCTGTCCCCCACTCCTCCCACCTCTGTCCAGAGG
CCCCTTCTCTGGACTAGACGGGCTCTCAAACTTCTGTGTTGCCTTTCTTCCAATTAGGCA
GGCTACAAACCATCAGAGCCATTTGTTGTTTGTTCCTTGAGGAAGAGGCAGTCTATCACA
ACTCTCTGATTCAAGGTCTGTCTCCCTCCCTGAAAACAATCCCTTCAGGATGACCCCCAA...
>ENSG00000087494|ENST00000201015|PTHLH|12|-1|28115255
TCCGCTCACGGGCCCCGAGACCCCCGAAGTTCCCATGGAGCCTAAGATCCCCAGGAGCCA
AGCCTGCCCCGTCCCTGCGGATCAGCTTCCTAATGGGCGACCCAAGTCTATCGCAGGCGG
TGGGGATGAGGACGCTGGGTGGGAGGAGGGGAGGGGAGGCTGAAAAAGATCATCCCCCTT
GCCCTAAGGCCTCTCCCAAGACCCTGGACCCCTGCCCTAAGAGACTCAGGCCTCCCTTGC
TGCAGTGGGAGCGCAAACACCAGGGCAGGAGACTCCAGAGAAGGAGCGCATAACTCAACG
TTTGCTCTCCTGAAGCCTTATTTCTGATAAAAATTACAGAAAAGTTAGGCAGGATCCAAA
GACACCGTAATGACCAGCTCAAAGCCAAACAGACAGGACATCCAGTGCGGGTGTCTGGAT...
As you can see in the forth place after "|" there is the chromosome name: in the first one is: 9 the second one is: HSCHR6MHCCOX and the third one is: 12
I want to create a new FASTA file containing only 9, 12 sequences and in more generally 1-22, X and Y sequences.
What is fastest way to do it?
Thanks,
Tom.
I got the following error:
[faibuildcore] different line length in sequence 'ENSG00000006282|ENST00000006658|SPATA20|17|1|48624450'. Segmentation fault
what could be the problem?
thanks.
error is the result of wrong formatting of your fasta file... all sequence lines need to have the same max lenght, ie 60 or 70. anyway, you should index the genome, not your multifasta file.
just one quick observation, the correct way to write the segment of the chromosome is chr2:1-2000, not chr2:1:2000
THank you, leszek. It is the most convenient and comparatively fastest way I have found to remove the scaffolds in the reference genome.