Manipulating a fasta file to only have specific characters
2
0
Entering edit mode
4.0 years ago
zizigolu ★ 4.3k

Hi

I have a fasta file started by

>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I want a fasta which is a one line character string; just keep the nucleotides characters like enter image description here

Basically I should remove anything that is not T, C, G, A or N. After replacing any such characters with "N"

I have tried this but gives an empty file

cat input_fasta.fa | sed -r 's/[RYKMSWBVHD]/N/g' > output_fasta.fa

Can you help me?

Thank you so much

fasta sed • 1.8k views
ADD COMMENT
1
Entering edit mode

input:

$ cat test.fa 

>chr
RYKMSWBVHD
aTGC
ATGkK
ATGC
wWVhDd

output:

$ seqkit replace -sip '[^ATGCN]' -r "N" -w 0 test.fa | seqkit seq -us

NNNNNNNNNNATGCATGNNATGCNNNNNN
ADD REPLY
0
Entering edit mode

Linearize your fasta file using @Pierre's code (which you can easily find by searching for "linearize fasta", should be first hit). Then remove the first column to leave just the sequence.

ADD REPLY
3
Entering edit mode
4.0 years ago
zizigolu ★ 4.3k
wget http://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/chromosomes/chr2.fa.gz
zgrep -v ">chr" chr2.fa.gz | tr -d '\n' | sed -e '$a\'  > chr2.fa

ADD COMMENT
0
Entering edit mode
4.0 years ago
Qiongyi ▴ 180

Do you want to remove the header line in your input file? If so, your output file is not in fasta format. If your sequences are already in one line, the below command can do the trick.

grep -v ">" input_fasta.fa | sed 's/[RYKMSWBVHD]/N/g'  > output_fasta.fa
ADD COMMENT
0
Entering edit mode

No unfortunately the sequence as I have shown is not in one line and I want to convert that to a one line sequence only contains A, T, C, G and N

ADD REPLY
0
Entering edit mode

Do you know how to run a PERL script? If so, you can use my script to convert your fasta file to a one line format. Then use the above grep & sed command to do other stuffs. You may download the script @ https://github.com/Qiongyi/custom_PERL_scripts/

For your task, the following command should work:

perl linker.pl input_fasta.fa input_oneline.fa
grep -v ">" input_oneline.fa | sed 's/[RYKMSWBVHD]/N/g'  > output_fasta.fa
ADD REPLY
0
Entering edit mode

Thank you

The link says not found

ADD REPLY
0
Entering edit mode

GO TO https://github.com/Qiongyi/custom_PERL_scripts/ AND THEN FIND linker.pl

ADD REPLY

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6