Question

Extraction of nt bases from sequence

0

Entering edit mode

10.3 years ago

GP ▴ 10

Hi All,

I have a fasta file containing approx 600bp long reads. I want to extract first 12 bases from each read along with fasta header and write into new file, is there any one liner or easy way to do this job? Any help is appreciated. Thanks!

sequence • 2.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by GP ▴ 10

Ram · Answer 1 · 2015-04-29

2

Entering edit mode

10.3 years ago

iraun 6.2k

Assuming that your fasta file has only one line per sequence, you can use this awk command:

awk '{if ($1 ~ />/) {print}else{print substr ($0, 0, 12)}}' file.fa

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by iraun 6.2k

1

Entering edit mode

Or if the headers are ≤ 12 characters then simply

cut -c1-12 file > out

And if it's multiline (same header restriction):

cut -c1-12 file | grep -A1 "^>" | grep -v "^-" > out

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by 5heikki 11k

0

Entering edit mode

Header is > 12 characters, thanks though, works as well!

ADD REPLY • link 10.3 years ago by GP ▴ 10

0

Entering edit mode

Thanks very much airan, its working. Actually I had multline fasta seqs that I formatted to single line, is there any tweaking in your one liner awk command to use it for multiline fasta file. Best, D

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by GP ▴ 10

2

Entering edit mode

Glad to help :). If you have a multiline fasta file, I guess that you could use this one:

awk '/^>/{print;getline;print substr ($0, 0, 12)}' file.fa

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by iraun 6.2k

0

Entering edit mode

Perfect!! again, thanks :-)

ADD REPLY • link 10.3 years ago by GP ▴ 10

0

Entering edit mode

I am trying this code for a big file. Just as an example first 4 sequences, each 30000 nucleotide

When I try the following code, it works.

awk '/^>/{print;getline;print substr ($0, 20, 25)}' NCBIaligned.fasta > out.fasta

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ccaggtaacaaaccaaccaactttc
>MW476548.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101350191/2020, complete genome
ccaggtaacaaaccaaccaactttc
>MW476562.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101876696/2020, complete genome
ccaggtaacaaaccaaccaactttc
>MW476590.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101037291/2020, complete genome
ccaggtaacaaaccaaccaactttc

However, for the position of more than 1000, it does not work and there is not sequence.

awk '/^>/{print;getline;print substr ($0, 28888, 25)}' NCBIaligned.fasta > out.fasta

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

>MW476548.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101350191/2020, complete genome

>MW476562.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101876696/2020, complete genome

>MW476590.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101037291/2020, complete genome

ADD REPLY • link 4.6 years ago by Explorer ▴ 10