how to remove specific sequences from multi-fasta file which contains N ?

0

Entering edit mode

5.0 years ago

k.kathirvel93 ▴ 310

Hi EveryOne,

I have a multifasta file which contains 11000 (30kb each) genomes. Now i want to remove all the reads(whole genome) which contains N (minimum atleast one N ). How can I do this with sed or awk? Thanks in advance.

I have input like this :

Genome1 ATCGTCGTACAGATACAGATACANNNcGATAGACATAGACA

Genome2 AGTCGATCAGTACAGATACAGATACAGATACAGATAC

I want output like this

Genome2 AGTCGATCAGTACAGATACAGATACAGATACAGATAC

genome sequencing sequence alignment • 1.8k views

ADD COMMENT • link 5.0 years ago by k.kathirvel93 ▴ 310

0

Entering edit mode

Hello k.kathirvel93!

Questions similar to yours can already be found at:

removing fasta sequences that have Ns in it in a fasta file

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks @Pierre Lindenbaum, I have gone through that thread you mentioned, but it was not working fine with my large data, coz after executed that code, still the genome have Ns. Since that thread was 4 yrs old, i created my own thread. Can you help with this? Thanks