Question

Extract chromosome 1 - 22 fasta file

0

Entering edit mode

3.6 years ago

shubhamkumbhar420 ▴ 40

Hello guys I have a fasta file called hg19.fa.gz and it contains chromosomes like

zcat hg19.fa.gz | grep ">"

chr1

chr2

chr3

chrUn_gl000211

chrUn_gl000221

chrUn_gl000214

chrUn_gl000228

chrUn_gl000227

chr1_gl000191_random

chr19_gl000208

and many more

Now I just want chromosomes from 1 to 22 including X and Y

Thank you

fasta • 4.5k views

ADD COMMENT • link updated 2.4 years ago by arsala521 ▴ 50 • written 3.6 years ago by shubhamkumbhar420 ▴ 40

score 3 · Answer 1 · 2021-04-16

3

Entering edit mode

3.6 years ago

shenwei356 8.7k

seqkit grep -i -r -p '^chr[\dXY]+$' h19.fa.gz -o result.fa.gz

ADD COMMENT • link 3.6 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you for your response! It worked !!!

ADD REPLY • link 3.6 years ago by shubhamkumbhar420 ▴ 40

0

Entering edit mode

@shenwei356 This command line is very useful. I wanted to get only canonical chromosomes from female gorilla genome fasta file, i.e. I just wanted chr1, chr2A, chr2B, chr3 - - - chr22, chrX. I edited this command as: seqkit grep -i -r -p '^chr[\dX'2A''2B']+$' gorGor6.fa > output.fa and it worked. I am trying to understand how this is working. I looked into the seqkit grep options -i, -r, and -p. If you can please tell how that string part is exactly working i.e. '^chr[\dXY]+$'. What +$ indicates? Thank you

ADD REPLY • link 2.5 years ago by arsala521 ▴ 50

1

Entering edit mode

It's a regular expression. Learn more:

And it should be '^chr\w+$' for your case.

ADD REPLY • link 2.5 years ago by shenwei356 8.7k

0

Entering edit mode

@shenwei356 Hi: Thanks for the help This expression '^chr\w+$' is actually not working. I have fasta file for female gorilla genome having several non-canonical entries. I only want the regular chromosomes i.e. chr1, chr2A, chr2B, chr3, chr4 ..... chr 22, chrX. What expression should I use? TIA

ADD REPLY • link 2.4 years ago by arsala521 ▴ 50

score 1 · Answer 2 · 2021-04-16

1

Entering edit mode

3.6 years ago

lmlukoseviciute ▴ 60

Hello, you I am not really sure if you want to see only chromosome names or the sequences it self. But as of the example you can try.

zcat hg19.fa.gz | egrep "> chr([1-9]$|[1-9]_|1[0-9]|2[0-2]|X|Y)"

ADD COMMENT • link 3.6 years ago by lmlukoseviciute ▴ 60