Entering edit mode
3.6 years ago
shubhamkumbhar420
▴
40
Hello guys I have a fasta file called hg19.fa.gz and it contains chromosomes like
zcat hg19.fa.gz | grep ">"
chr1
chr2
chr3
chrUn_gl000211
chrUn_gl000221
chrUn_gl000214
chrUn_gl000228
chrUn_gl000227
chr1_gl000191_random
chr19_gl000208
and many more
Now I just want chromosomes from 1 to 22 including X and Y
Thank you
Thank you for your response! It worked !!!
@shenwei356 This command line is very useful. I wanted to get only canonical chromosomes from female gorilla genome fasta file, i.e. I just wanted chr1, chr2A, chr2B, chr3 - - - chr22, chrX. I edited this command as: seqkit grep -i -r -p '^chr[\dX'2A''2B']+$' gorGor6.fa > output.fa and it worked. I am trying to understand how this is working. I looked into the seqkit grep options -i, -r, and -p. If you can please tell how that string part is exactly working i.e. '^chr[\dXY]+$'. What +$ indicates? Thank you
It's a regular expression. Learn more:
And it should be
'^chr\w+$'
for your case.@shenwei356 Hi: Thanks for the help This expression '^chr\w+$' is actually not working. I have fasta file for female gorilla genome having several non-canonical entries. I only want the regular chromosomes i.e. chr1, chr2A, chr2B, chr3, chr4 ..... chr 22, chrX. What expression should I use? TIA