Subseq a FASTA file
1
0
Entering edit mode
3.9 years ago
Explorer ▴ 10

I am trying to subsetting a FASTA file at a specific nucleotide positions. For example

 >random sequence 1
    tatgtgcgag
    >random sequence 2
    agggtgttat
    >random sequence 3
    tatgtgcgag
    >random sequence 4
    gactcgcggt
    >random sequence 5
    tatgtgcgag
    >random sequence 6
    gcagccatcg
    >random sequence 7
    gactcgcggt
    >random sequence 8
    tatgtgcgag
    >random sequence 9
    tatgtgcgag
    >random sequence 10
    tatgtgcgag

I am able to cut the sequence from position 3 to 6 but ID is missing. I want to same IDs as the original file. Can anyone help to modify my code, please? Thanks

cat random.fasta |sed -n 2~2p |cut -c3-6 >out.fasta

    tgtg
    ggtg
    tgtg
    ctcg
    tgtg
    agcc
    ctcg
    tgtg
    tgtg
    tgtg
awk sed • 1.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks for sharing the link. I tried the command for multiline FASTA and it partially worked. I am extending the thread there.

ADD REPLY
0
Entering edit mode

or you sure you are using a multiline fasta?

With 'multiline' we mean that the sequence is block-formatted and is thus present on several lines under 1 header. It does not refer to the fact you have several entries in a single fasta file.

ADD REPLY
0
Entering edit mode

I am not sure about that. When I open it in a text editor, there are blocks of 60nt each while in SnapGene, ApE it is as per software settings. I got confused because the code suggested for multiline sort-of worked.

In the original thread shared by Pierre Lindenbaum, the following code was suggested.

For single line FASTA file

awk '{if ($1 ~ />/) {print}else{print substr ($0, 0, 12)}}' file.fa

For multiline FASTA file

awk '/^>/{print;getline;print substr ($0, 0, 12)}' file.fa

When I tried the first code, there was no subsetting, while for 2nd code it worked up to 1000 position.

ADD REPLY
0
Entering edit mode
3.9 years ago

if it does not have to be in plain bash scripting you might be better of using specific tools to achieve this.

For instance use SeqKit, more precise the subseq command from it. Have a look here https://bioinf.shenwei.me/seqkit/usage/#subseq on how to use is.

ADD COMMENT
0
Entering edit mode

It seems SeqKit is more memory efficient. I will try it if bash does not work.

ADD REPLY

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6