Filtering Fasta Sequences By Chromosomes Names From A Big Fasta File
5
7
Entering edit mode
12.4 years ago
sthait ▴ 130

Hello,

I wish to filter from a big FASTA file only sequences related to chrs 1-22, X and Y. An example of FASTA sequence is:

>ENSG00000119314|ENST00000210227|PTBP3|9|-1|115024785
GGGTGGCAGGTGCCTGTAATCCCAGCTACTCCAGAGGCTGAGGCAGGGGAATTGCTTGAG
CCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATTGTGCCACTGCACTCCAGCCTGGAGTCT
CACTTTGTCACACAGGGTGGAGTGCAGTGGTGTGATCTCGGCTCACTGCAACCTCTGCTT
ACCGGGTTGAGATTCTCCTGTCTCAACCTCCTGAGTAGCTGGGATTACAGGCGTGCACCA
CCAAGCCAGACTAATTTTCCTATTTTTAGTAGAGATGGGGGGTTTCACCATGTTGGCCAG...

>ENSG00000236011|ENST00000211377|GPANK1|HSCHR6_MHC_COX|-1|31616421
CCCTATTCCTACCTAACCTCCCCTCAGGACTCAGGCTCCAATGTGTTGAGCCCCAACTCC
TTCCCATAAGACTGCCACACGGTGCTTTCCTTTCCCTTCTTCAACACTCACCAATGGGAA
GCATTGGCTGGTTCTCACAGTACACACGAGGACAGTAACCAAAGTCTCCTTGCTGGTACT
TTTCCAACTGAGGTGAATACAATGGAAGGGGTTGGCAGGTAGATGTAAAGAAGAGGCAAC
TCCCTTCGCAGCCCAACCCATACCACTCTGTCCCCCACTCCTCCCACCTCTGTCCAGAGG
CCCCTTCTCTGGACTAGACGGGCTCTCAAACTTCTGTGTTGCCTTTCTTCCAATTAGGCA
GGCTACAAACCATCAGAGCCATTTGTTGTTTGTTCCTTGAGGAAGAGGCAGTCTATCACA
ACTCTCTGATTCAAGGTCTGTCTCCCTCCCTGAAAACAATCCCTTCAGGATGACCCCCAA...

>ENSG00000087494|ENST00000201015|PTHLH|12|-1|28115255
TCCGCTCACGGGCCCCGAGACCCCCGAAGTTCCCATGGAGCCTAAGATCCCCAGGAGCCA
AGCCTGCCCCGTCCCTGCGGATCAGCTTCCTAATGGGCGACCCAAGTCTATCGCAGGCGG
TGGGGATGAGGACGCTGGGTGGGAGGAGGGGAGGGGAGGCTGAAAAAGATCATCCCCCTT
GCCCTAAGGCCTCTCCCAAGACCCTGGACCCCTGCCCTAAGAGACTCAGGCCTCCCTTGC
TGCAGTGGGAGCGCAAACACCAGGGCAGGAGACTCCAGAGAAGGAGCGCATAACTCAACG
TTTGCTCTCCTGAAGCCTTATTTCTGATAAAAATTACAGAAAAGTTAGGCAGGATCCAAA
GACACCGTAATGACCAGCTCAAAGCCAAACAGACAGGACATCCAGTGCGGGTGTCTGGAT...

As you can see in the forth place after "|" there is the chromosome name: in the first one is: 9 the second one is: HSCHR6MHCCOX and the third one is: 12

I want to create a new FASTA file containing only 9, 12 sequences and in more generally 1-22, X and Y sequences.

What is fastest way to do it?

Thanks,

Tom.

fasta filter chromosome • 31k views
ADD COMMENT
24
Entering edit mode
12.4 years ago
Leszek 4.2k

You can use samtools for that:

#index a genome
samtools faidx human_genome.fa
#select chromosomes or regions
samtools faidx human_genome.fa chr1 chr2:1:2000 [...] > human_selected.fa
ADD COMMENT
1
Entering edit mode

I got the following error:

[faibuildcore] different line length in sequence 'ENSG00000006282|ENST00000006658|SPATA20|17|1|48624450'. Segmentation fault

what could be the problem?

thanks.

ADD REPLY
0
Entering edit mode

error is the result of wrong formatting of your fasta file... all sequence lines need to have the same max lenght, ie 60 or 70. anyway, you should index the genome, not your multifasta file.

ADD REPLY
0
Entering edit mode

just one quick observation, the correct way to write the segment of the chromosome is chr2:1-2000, not chr2:1:2000

ADD REPLY
0
Entering edit mode

THank you, leszek. It is the most convenient and comparatively fastest way I have found to remove the scaffolds in the reference genome.

ADD REPLY
7
Entering edit mode
12.4 years ago

You can do the following:

chr=$(seq 1 1 22)" X Y"
for i in $chr ; do 
    grep ">" input.fa | grep "|$i|" | awk 'BEGIN {FS=">"} {print $2}' > selection_$i
    samtools faidx input.fa `cat selection_$i` > output_$i.fa
done
ADD COMMENT
1
Entering edit mode

It did not work for me - nothing was printed to file. However this modification did:

for i in $chr ; do
    samtools faidx $fasta $i  >> chr_sequences.fa
done

It should be noted that this puts all sequences into a single file.

ADD REPLY
0
Entering edit mode

What is $fasta here? You did not make this clear...

ADD REPLY
0
Entering edit mode

$fasta is a variable that represents the input fasta. It should be defined before the loop as fasta=my_fancy.fasta

ADD REPLY
0
Entering edit mode

thanks, i'm pretty new in this. the code you gave me - in which program should I run it - terminal/perl/python?

tom.

ADD REPLY
0
Entering edit mode

This is a bash script, so you can just run it as is in a terminal.

ADD REPLY
0
Entering edit mode
12.4 years ago
JC 13k

if you want Ensemble gene models, you can use BioMart to obtain your sequence filtered: http://uswest.ensembl.org/biomart/martview

ADD COMMENT
0
Entering edit mode
9.4 years ago

You can do some pattern matching using pyfaidx:

faidx --regex "^.+\|.+\|.+\|[[1-9][0-2]?|[XY]]" input.fa > output.fa
ADD COMMENT

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6