Question

Reduce poly-N regions longer than 200bp back to 200bp in a reference genome FASTA

0

Entering edit mode

20 months ago

William ★ 5.3k

I have a reference genome with some large chromosomes with long poly-N regions.

Unfortunately, these long poly-N regions (i.e. size estimated gaps in the reference genome) cause the chromosomes to be longer than some downstream bio-informatics tools accept.

Collapsing these long poly-N regions to be max e.g. 200bp would likely bring the chromosomes within the accepted maximum chromosome size of the downstream tools. And these size estimated gaps are of course not used for read mapping or SNP calling.

I thought about using linux tr -s 'N'

https://www.gnu.org/software/coreutils/manual/html_node/Squeezing-and-deleting.html

But that would reduce all poly-N sequences in the reference genome to just 1 N.

And I would like to reduce only poly-N regions longer than e.g. 200bp back to e.g 200bp.

Is there a good way to do this?

FASTA poly-N • 811 views

ADD COMMENT • link 20 months ago by William ★ 5.3k

score 2 · Accepted Answer · 2023-03-23

2

Entering edit mode

20 months ago

Pierre Lindenbaum 164k

eg for chr22, using sed ,j replace 1000 N with a few N :

samtools faidx ref.fasta "chr22" |\
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  |\
sed -r 's/N{1000,}/NNNNNNNNNNNNNNNN/g' |\
tr "\t" "\n" |\
fold -w 100

ADD COMMENT • link 20 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Something like this looks like it should work. Unfortunately also SED does not like the long chomosomes.

sed: regex input buffer length larger than INT_MAX

So probably need to change this to work on a multi-line FASTA text steam with long lines.

ADD REPLY • link 20 months ago by William ★ 5.3k

1

Entering edit mode

An easier solution is maybe even to create a multi-line fasta , with e.g. line sequence length of 1000bp. And just replace all 1000bp poly-N lines with 100bp poly-N lines. Then reformat fixed width fasta, and maybe run this iteratively.

ADD REPLY • link 20 months ago by William ★ 5.3k