I have a reference genome with some large chromosomes with long poly-N regions.
Unfortunately, these long poly-N regions (i.e. size estimated gaps in the reference genome) cause the chromosomes to be longer than some downstream bio-informatics tools accept.
Collapsing these long poly-N regions to be max e.g. 200bp would likely bring the chromosomes within the accepted maximum chromosome size of the downstream tools. And these size estimated gaps are of course not used for read mapping or SNP calling.
I thought about using linux tr -s 'N'
https://www.gnu.org/software/coreutils/manual/html_node/Squeezing-and-deleting.html
But that would reduce all poly-N sequences in the reference genome to just 1 N.
And I would like to reduce only poly-N regions longer than e.g. 200bp back to e.g 200bp.
Is there a good way to do this?
Something like this looks like it should work. Unfortunately also SED does not like the long chomosomes.
So probably need to change this to work on a multi-line FASTA text steam with long lines.
An easier solution is maybe even to create a multi-line fasta , with e.g. line sequence length of 1000bp. And just replace all 1000bp poly-N lines with 100bp poly-N lines. Then reformat fixed width fasta, and maybe run this iteratively.