Filter number of N's fasta file

0

Entering edit mode

3.5 years ago

gubrins ▴ 350

Heys,

Once again I need your programming help. I have a lot of fasta files made out of 1Mb sliding window along a reference genome. As there areas in the genome that are not really well sequenced or that the sample has not a lot of data, I would like to remove the files where at least one sample has half of the information as N. How could I do that?

Thanks a lot in advance!

bash programming fasta • 937 views

ADD COMMENT • link updated 3.5 years ago by GenoMax 151k • written 3.5 years ago by gubrins ▴ 350

1

Entering edit mode

where at least one sample has half of the information as N

i don't understand.

ADD REPLY • link 3.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

sorry Pierre. Each one of my fasta files has 1Mb of information. I would like to know if any sample within each fasta file has 50% or more bases as N.

ADD REPLY • link 3.5 years ago by gubrins ▴ 350

2

Entering edit mode

Counting N'S Within Fasta

You can use stats.sh program from BBMap suite to generate the base distribution (only relevant part is posted here). You can easily see files where N content would be > 50%.

$ stats.sh in=t2.fa
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2394  0.2810  0.1935  0.2861  0.0096  0.0000  0.0000  0.4745  0.0000

ADD REPLY • link 3.5 years ago by GenoMax 151k

Login before adding your answer.