Entering edit mode
6.6 years ago
t86dan
▴
30
Hello,
I have a fasta file for human Genome GRCh37 Reference Assembly, but for some reason chromosome 20 is repeated and I would like to remove it. I know its repeated because when I use grep to look for '>' it shows the following which is the list of all the chromosomes (20 appears 2 times):
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr20 chr21 chr22 chrX chrY chrM
Any help on removing the repeated chromosome 20 would be appreciated.
Thanks in advance!
I would also double check that the sequence is the same.
Have you tried any tools/commands? There are multiple threads on this forum about modifying fasta files: Remove Fasta Sequences with Duplicate IDs (but with different Descriptions) & Append Different Descriptions
Remove duplicates in fasta file based on ID
Remove duplicate sequences with same id from a fasta file
Is there a reason why this happens? dups sequences are understandable, but having one chromosome dups? are you working with cancer data?