Question

Subseq a FASTA file

0

Entering edit mode

4.6 years ago

Explorer ▴ 10

I am trying to subsetting a FASTA file at a specific nucleotide positions. For example

 >random sequence 1
    tatgtgcgag
    >random sequence 2
    agggtgttat
    >random sequence 3
    tatgtgcgag
    >random sequence 4
    gactcgcggt
    >random sequence 5
    tatgtgcgag
    >random sequence 6
    gcagccatcg
    >random sequence 7
    gactcgcggt
    >random sequence 8
    tatgtgcgag
    >random sequence 9
    tatgtgcgag
    >random sequence 10
    tatgtgcgag

I am able to cut the sequence from position 3 to 6 but ID is missing. I want to same IDs as the original file. Can anyone help to modify my code, please? Thanks

cat random.fasta |sed -n 2~2p |cut -c3-6 >out.fasta

    tgtg
    ggtg
    tgtg
    ctcg
    tgtg
    agcc
    ctcg
    tgtg
    tgtg
    tgtg

awk sed • 1.7k views

ADD COMMENT • link 4.6 years ago by Explorer ▴ 10

1

Entering edit mode

Extraction of nt bases from sequence

ADD REPLY • link 4.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks for sharing the link. I tried the command for multiline FASTA and it partially worked. I am extending the thread there.

ADD REPLY • link 4.6 years ago by Explorer ▴ 10

0

Entering edit mode

or you sure you are using a multiline fasta?

With 'multiline' we mean that the sequence is block-formatted and is thus present on several lines under 1 header. It does not refer to the fact you have several entries in a single fasta file.

ADD REPLY • link 4.6 years ago by lieven.sterck 15k

0

Entering edit mode

I am not sure about that. When I open it in a text editor, there are blocks of 60nt each while in SnapGene, ApE it is as per software settings. I got confused because the code suggested for multiline sort-of worked.

In the original thread shared by Pierre Lindenbaum, the following code was suggested.

For single line FASTA file

awk '{if ($1 ~ />/) {print}else{print substr ($0, 0, 12)}}' file.fa

For multiline FASTA file

awk '/^>/{print;getline;print substr ($0, 0, 12)}' file.fa

When I tried the first code, there was no subsetting, while for 2nd code it worked up to 1000 position.

ADD REPLY • link 4.6 years ago by Explorer ▴ 10

score 0 · Answer 1 · 2021-01-14

0

Entering edit mode

4.6 years ago

lieven.sterck 15k

if it does not have to be in plain bash scripting you might be better of using specific tools to achieve this.

For instance use SeqKit, more precise the subseq command from it. Have a look here https://bioinf.shenwei.me/seqkit/usage/#subseq on how to use is.