processing FASTA file
3
0
Entering edit mode
7.4 years ago
s_bio ▴ 10

Dear All,

I have a file with lots of sequences in FASTA format. All these sequences are around 7.2-7.5 kb long. However, I want to retain first 1000 nts and last 1200 nts and want to delete all the remaining middle nts. I would appreciate if anybody can guide me how its done.

Thx :-)

genome R • 3.1k views
ADD COMMENT
1
Entering edit mode

Why did you tag R? Others have given some viable options, but you didn't post anything that you had tried using R (i.e. code snippet), or that you need to use R. This would be easy with Biopython.

ADD REPLY
1
Entering edit mode

Example code to retain first 2 and last two bases of fasta sequences in a single file:

my fasta file:

>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG

code:

paste <(seqkit subseq -r  1:2 test1.fa | paste - -) <(seqkit subseq -r -2:-1 test1.fa | paste -  -  | cut -f 2) | awk '{print $1"\n"$2 $3}'

output:

>test
ACGT
>test2
TGCG

Download seqkit from here

ADD REPLY
0
Entering edit mode

Another solution:

seqkit fx2tab testnt.fa |awk  '{print $1, "\t", substr($2,1,2), x = substr($2,length($2)-1,length ($2))}' OFS= | seqkit tab2fx
ADD REPLY
0
Entering edit mode

Dear Vinayjrao and Pierre,

Thank you so much guys for sharing the code lines. I was able to make the desired files with the help of your code lines.

Best!

s_bio

ADD REPLY
1
Entering edit mode

If those answers have helped you then consider "upvoting" and "accepting" (use green check mark) to provide closure for this thread.

ADD REPLY
0
Entering edit mode

Thanks for the useful answer. I have an extra problem since, since some sequences are missing on my files. Departing from:

File1.fas
>seq1    acacaca
>seq2    acacacg
>seq3    acacact
File 2.fas
>seq1    GCGCGCG
>seq3    GCGCGCT

Is it possible to get the following using seqkit concat

File3.fas
>seq1    acacacaGCGCGC
>seq2    acacacgNNNNNN
>seq3    acacactGCGCGCT

I am using: $seqkit concat File*.fas -o File3.fas, and I get file 3 with different lengths according to the missing data. (of course i have more than two fasta files, and there are missing sequences on all of them)

Thanks!!!

ADD REPLY
0
Entering edit mode

source of Ns in seq2? padding ? @OP

ADD REPLY
0
Entering edit mode

I forgot to specify that in my study each file correspond to a different pcr amplify genetic marker, and that seq1-3 correspond to different species. The source of N's in seq2 is the missing marker included in File 2 (e.g. we couldn't amplify it or it is missing in the reference database).

ADD REPLY
0
Entering edit mode

amartinez.ull : Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized. This comment/question should have gone under @shenwei's answer.

ADD REPLY
0
Entering edit mode

I will do next time. It is my first post here, and I am still not very familiar with the rule. I apologize.

ADD REPLY
0
Entering edit mode

@ amartinez.ull : If length of the sequence is fixed, you can follow this post What is the fastest way to add 'Ns' to variable length sequences in a .fasta such that they have the same length. See the post by Petr Ponomarenko and upvote the OP.

ADD REPLY
0
Entering edit mode

Thanks! Seems like a potential solution, but as far as I see, this still requieres that I manually include the name of the missing markers in "File 2" (following the names in my original example) and use that function. It must be a quicker way to do it...

ADD REPLY
0
Entering edit mode

You don't have to fill in. Concat the files using seqkit and then use this script. Input would be stdin. Concern i have is if the lengths are not fixed, then take the maximum length of the sequence, store that in a variable, use it to pad the sequence, for each id

$ seqkit concat a.fa b.fa --quiet | awk '$1~">"{print $0}$1!~">"{tmp="";for(i=1;i<15-length($0);i++){tmp=tmp"N"};print $0""tmp}' 

>seq1
acacacaGCGCGCG
>seq2
acacacgNNNNNNN
>seq3
acacactGCGCGCT

In this example (posted in oP), length of each sequence is fixed 15bp. Then it is easier to pad. If the padding is done as per length of largest sequence, then you need to find largest sequence, it's length, store it in a variable, then apply above code.

ADD REPLY
1
Entering edit mode
7.4 years ago

linearize and extract the 5' and 3' part

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
awk -F '\t' '{L=length($2);if(L<2200) {printf("%s\n%s\n",$1,$2);} else {printf("%s\n%s%s\n",$1,substr($2,1,1000),substr($2,L-1200,1200));}}'
ADD COMMENT
1
Entering edit mode
7.4 years ago
vinayjrao ▴ 250

If I understand your question correctly, you could try grep -v ">" file.fa | head -c 1000 > 1000.txt tail -c 1200 file.fa > 1200.txt

ADD COMMENT
1
Entering edit mode

I assume that header information is important:)

ADD REPLY
0
Entering edit mode

Grant is right. file.fa | head -c 1000 > 1000.txt

tail -c 1200 file.fa > 1200.txt would work better

ADD REPLY
0
Entering edit mode
7.3 years ago

A simple solution using seqkit concate (concatenate sequences with same ID from multiple files) newly added in v0.7.0:

$ seqkit concate <(seqkit subseq -r 1:1000 seqs.fa) <(seqkit subseq -r -1200:-1 seqs.fa)
ADD COMMENT
1
Entering edit mode

@shenwei356: Consider renaming this option concat.

ADD REPLY
0
Entering edit mode

I'll fix it. thank you dear @genomax !

ADD REPLY

Login before adding your answer.

Traffic: 1234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6