Question

Filtering Fasta Sequences By Chromosomes Names From A Big Fasta File

7

Entering edit mode

12.4 years ago

sthait ▴ 130

Hello,

I wish to filter from a big FASTA file only sequences related to chrs 1-22, X and Y. An example of FASTA sequence is:

>ENSG00000119314|ENST00000210227|PTBP3|9|-1|115024785
GGGTGGCAGGTGCCTGTAATCCCAGCTACTCCAGAGGCTGAGGCAGGGGAATTGCTTGAG
CCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATTGTGCCACTGCACTCCAGCCTGGAGTCT
CACTTTGTCACACAGGGTGGAGTGCAGTGGTGTGATCTCGGCTCACTGCAACCTCTGCTT
ACCGGGTTGAGATTCTCCTGTCTCAACCTCCTGAGTAGCTGGGATTACAGGCGTGCACCA
CCAAGCCAGACTAATTTTCCTATTTTTAGTAGAGATGGGGGGTTTCACCATGTTGGCCAG...

>ENSG00000236011|ENST00000211377|GPANK1|HSCHR6_MHC_COX|-1|31616421
CCCTATTCCTACCTAACCTCCCCTCAGGACTCAGGCTCCAATGTGTTGAGCCCCAACTCC
TTCCCATAAGACTGCCACACGGTGCTTTCCTTTCCCTTCTTCAACACTCACCAATGGGAA
GCATTGGCTGGTTCTCACAGTACACACGAGGACAGTAACCAAAGTCTCCTTGCTGGTACT
TTTCCAACTGAGGTGAATACAATGGAAGGGGTTGGCAGGTAGATGTAAAGAAGAGGCAAC
TCCCTTCGCAGCCCAACCCATACCACTCTGTCCCCCACTCCTCCCACCTCTGTCCAGAGG
CCCCTTCTCTGGACTAGACGGGCTCTCAAACTTCTGTGTTGCCTTTCTTCCAATTAGGCA
GGCTACAAACCATCAGAGCCATTTGTTGTTTGTTCCTTGAGGAAGAGGCAGTCTATCACA
ACTCTCTGATTCAAGGTCTGTCTCCCTCCCTGAAAACAATCCCTTCAGGATGACCCCCAA...

>ENSG00000087494|ENST00000201015|PTHLH|12|-1|28115255
TCCGCTCACGGGCCCCGAGACCCCCGAAGTTCCCATGGAGCCTAAGATCCCCAGGAGCCA
AGCCTGCCCCGTCCCTGCGGATCAGCTTCCTAATGGGCGACCCAAGTCTATCGCAGGCGG
TGGGGATGAGGACGCTGGGTGGGAGGAGGGGAGGGGAGGCTGAAAAAGATCATCCCCCTT
GCCCTAAGGCCTCTCCCAAGACCCTGGACCCCTGCCCTAAGAGACTCAGGCCTCCCTTGC
TGCAGTGGGAGCGCAAACACCAGGGCAGGAGACTCCAGAGAAGGAGCGCATAACTCAACG
TTTGCTCTCCTGAAGCCTTATTTCTGATAAAAATTACAGAAAAGTTAGGCAGGATCCAAA
GACACCGTAATGACCAGCTCAAAGCCAAACAGACAGGACATCCAGTGCGGGTGTCTGGAT...

As you can see in the forth place after "|" there is the chromosome name: in the first one is: 9 the second one is: HSCHR6MHCCOX and the third one is: 12

I want to create a new FASTA file containing only 9, 12 sequences and in more generally 1-22, X and Y sequences.

What is fastest way to do it?

Thanks,

Tom.

fasta filter chromosome • 31k views

ADD COMMENT • link updated 22 months ago by jixiangj ▴ 30 • written 12.4 years ago by sthait ▴ 130

7

Entering edit mode

12.4 years ago

Leonor Palmeira 3.9k

You can do the following:

chr=$(seq 1 1 22)" X Y"
for i in $chr ; do 
    grep ">" input.fa | grep "|$i|" | awk 'BEGIN {FS=">"} {print $2}' > selection_$i
    samtools faidx input.fa `cat selection_$i` > output_$i.fa
done

ADD COMMENT • link 12.4 years ago by Leonor Palmeira 3.9k

1

Entering edit mode

It did not work for me - nothing was printed to file. However this modification did:

for i in $chr ; do
    samtools faidx $fasta $i  >> chr_sequences.fa
done

It should be noted that this puts all sequences into a single file.

ADD REPLY • link 9.9 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

What is $fasta here? You did not make this clear...

ADD REPLY • link 7.5 years ago by ashley_conard • 0

0

Entering edit mode

$fasta is a variable that represents the input fasta. It should be defined before the loop as fasta=my_fancy.fasta

ADD REPLY • link 7.5 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

thanks, i'm pretty new in this. the code you gave me - in which program should I run it - terminal/perl/python?

tom.

ADD REPLY • link 12.4 years ago by sthait ▴ 130

0

Entering edit mode

This is a bash script, so you can just run it as is in a terminal.

ADD REPLY • link 12.4 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

12.4 years ago

JC 13k

if you want Ensemble gene models, you can use BioMart to obtain your sequence filtered: http://uswest.ensembl.org/biomart/martview

ADD COMMENT • link 12.4 years ago by JC 13k

0

Entering edit mode

9.4 years ago

Matt Shirley 10k

You can do some pattern matching using pyfaidx:

faidx --regex "^.+\|.+\|.+\|[[1-9][0-2]?|[XY]]" input.fa > output.fa

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by Matt Shirley 10k

score 24 · Accepted Answer · 2012-07-25

24

Entering edit mode

12.4 years ago

Leszek 4.2k

You can use samtools for that:

#index a genome
samtools faidx human_genome.fa
#select chromosomes or regions
samtools faidx human_genome.fa chr1 chr2:1:2000 [...] > human_selected.fa

ADD COMMENT • link 12.4 years ago by Leszek 4.2k

1

Entering edit mode

I got the following error:

what could be the problem?

thanks.

ADD REPLY • link 12.4 years ago by sthait ▴ 130

0

Entering edit mode

error is the result of wrong formatting of your fasta file... all sequence lines need to have the same max lenght, ie 60 or 70. anyway, you should index the genome, not your multifasta file.

ADD REPLY • link 12.4 years ago by Leszek 4.2k

0

Entering edit mode

just one quick observation, the correct way to write the segment of the chromosome is chr2:1-2000, not chr2:1:2000

ADD REPLY • link 3.4 years ago by shinken123 ▴ 150

0

Entering edit mode

THank you, leszek. It is the most convenient and comparatively fastest way I have found to remove the scaffolds in the reference genome.

ADD REPLY • link 22 months ago by jixiangj ▴ 30