Extraction of nt bases from sequence
1
0
Entering edit mode
9.6 years ago
GP ▴ 10

Hi All,

I have a fasta file containing approx 600bp long reads. I want to extract first 12 bases from each read along with fasta header and write into new file, is there any one liner or easy way to do this job? Any help is appreciated. Thanks!

sequence • 2.6k views
ADD COMMENT
2
Entering edit mode
9.6 years ago
iraun 6.2k

Assuming that your fasta file has only one line per sequence, you can use this awk command:

awk '{if ($1 ~ />/) {print}else{print substr ($0, 0, 12)}}' file.fa
ADD COMMENT
1
Entering edit mode

Or if the headers are ≤ 12 characters then simply

cut -c1-12 file > out

And if it's multiline (same header restriction):

cut -c1-12 file | grep -A1 "^>" | grep -v "^-" > out
ADD REPLY
0
Entering edit mode

Header is > 12 characters, thanks though, works as well!

ADD REPLY
0
Entering edit mode

Thanks very much airan, its working. Actually I had multline fasta seqs that I formatted to single line, is there any tweaking in your one liner awk command to use it for multiline fasta file. Best, D

ADD REPLY
2
Entering edit mode

Glad to help :). If you have a multiline fasta file, I guess that you could use this one:

awk '/^>/{print;getline;print substr ($0, 0, 12)}' file.fa
ADD REPLY
0
Entering edit mode

Perfect!! again, thanks :-)

ADD REPLY
0
Entering edit mode

I am trying this code for a big file. Just as an example first 4 sequences, each 30000 nucleotide

When I try the following code, it works.

awk '/^>/{print;getline;print substr ($0, 20, 25)}' NCBIaligned.fasta > out.fasta

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ccaggtaacaaaccaaccaactttc
>MW476548.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101350191/2020, complete genome
ccaggtaacaaaccaaccaactttc
>MW476562.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101876696/2020, complete genome
ccaggtaacaaaccaaccaactttc
>MW476590.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101037291/2020, complete genome
ccaggtaacaaaccaaccaactttc

However, for the position of more than 1000, it does not work and there is not sequence.

awk '/^>/{print;getline;print substr ($0, 28888, 25)}' NCBIaligned.fasta > out.fasta

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome

>MW476548.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101350191/2020, complete genome

>MW476562.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101876696/2020, complete genome

>MW476590.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/UT-UPHL-2101037291/2020, complete genome
ADD REPLY

Login before adding your answer.

Traffic: 1802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6