Extract chromosome 1 - 22 fasta file
2
0
Entering edit mode
3.6 years ago

Hello guys I have a fasta file called hg19.fa.gz and it contains chromosomes like

zcat hg19.fa.gz | grep ">"

chr1

chr2

chr3

chrUn_gl000211

chrUn_gl000221

chrUn_gl000214

chrUn_gl000228

chrUn_gl000227

chr1_gl000191_random

chr19_gl000208

and many more

Now I just want chromosomes from 1 to 22 including X and Y

Thank you

fasta • 4.5k views
ADD COMMENT
3
Entering edit mode
3.6 years ago
seqkit grep -i -r -p '^chr[\dXY]+$' h19.fa.gz -o result.fa.gz
ADD COMMENT
0
Entering edit mode

Thank you for your response! It worked !!!

ADD REPLY
0
Entering edit mode

@shenwei356 This command line is very useful. I wanted to get only canonical chromosomes from female gorilla genome fasta file, i.e. I just wanted chr1, chr2A, chr2B, chr3 - - - chr22, chrX. I edited this command as: seqkit grep -i -r -p '^chr[\dX'2A''2B']+$' gorGor6.fa > output.fa and it worked. I am trying to understand how this is working. I looked into the seqkit grep options -i, -r, and -p. If you can please tell how that string part is exactly working i.e. '^chr[\dXY]+$'. What +$ indicates? Thank you

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

@shenwei356 Hi: Thanks for the help This expression '^chr\w+$' is actually not working. I have fasta file for female gorilla genome having several non-canonical entries. I only want the regular chromosomes i.e. chr1, chr2A, chr2B, chr3, chr4 ..... chr 22, chrX. What expression should I use? TIA

ADD REPLY
1
Entering edit mode
3.6 years ago

Hello, you I am not really sure if you want to see only chromosome names or the sequences it self. But as of the example you can try.

zcat hg19.fa.gz | egrep "> chr([1-9]$|[1-9]_|1[0-9]|2[0-2]|X|Y)"

ADD COMMENT

Login before adding your answer.

Traffic: 1812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6