Hi, I have a fasta file for a genome assembly with 646 sequences. The first 7 are pseudochromosomes and the rest are unassigned scaffolds. And the headers look something like this:
>GWHABKY00000001 Chromosome 1 Complete=F Circular=F OriSeqID=chr1 Len=553355525
>GWHABKY00000002 Chromosome 2 Complete=F Circular=F OriSeqID=chr2 Len=740519526
>GWHABKY00000003 Chromosome 3 Complete=F Circular=F OriSeqID=chr3 Len=676969686
>GWHABKY00000004 Chromosome 4 Complete=F Circular=F OriSeqID=chr4 Len=612967577
>GWHABKY00000005 Chromosome 5 Complete=F Circular=F OriSeqID=chr5 Len=625473173
>GWHABKY00000006 Chromosome 6 Complete=F Circular=F OriSeqID=chr6 Len=584270320
>GWHABKY00000007 Chromosome 7 Complete=F Circular=F OriSeqID=chr7 Len=744096988
>GWHABKY00000008 OriSeqID=scaffold1 Len=1816015
>GWHABKY00000009 OriSeqID=scaffold10 Len=942477
>GWHABKY00000010 OriSeqID=scaffold100 Len=268586
>GWHABKY00000011 OriSeqID=scaffold101 Len=265196
>GWHABKY00000012 OriSeqID=scaffold102 Len=259718
>GWHABKY00000013 OriSeqID=scaffold103 Len=258511
>GWHABKY00000014 OriSeqID=scaffold104 Len=258489
>GWHABKY00000015 OriSeqID=scaffold105 Len=257418
>GWHABKY00000016 OriSeqID=scaffold106 Len=256425
...
I want to edit the headers so that the first 7 just say:
>Chr1E
>Chr2E
...
And for the rest I just want the scaffold ID
>scaffold1
>scaffold10
...
What's the best way to do this using sed/awk?
This question has been addressed multiple times on the forum. Please use the search bar. Sample search: https://www.biostars.org/post/search/?query=fasta+header
Assuming that sequences are in single line:
or using perl: