Question

Extraction of first sequences from a big fasta file

4

Entering edit mode

9.7 years ago

vahapel ▴ 210

Dear All,

I would like to ask a question regarding extraction of 100000 sequences in a big fasta file. In the forum, there is a bunch of script handling the sequence extraction based on ID number, but I could not find a script for such a purpose. Basically, is there any script or bash command for extraction first and/or last 100000 sequences in a fasta file?

Many thanks in advance for all your help!

sequence • 17k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by vahapel ▴ 210

0

Entering edit mode

If not, you could write one easily enough with biopython or bioperl...

Well, the first X records is easier than the last X records, but still.

ADD REPLY • link 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

You can have a look at this. May be helpful for you

Extract sequence with header from a fasta file with specific ID given in another file

ADD REPLY • link 8.5 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

That is not an answer for the question that was originally asked.

ADD REPLY • link 8.5 years ago by GenoMax 146k

Ram · Answer 1 · 2015-02-20

6

Entering edit mode

9.7 years ago

dariober 15k

This strategy is based on standard nix tools. Get the *first two sequences:

awk -v RS='>' 'NR>1 { gsub("\n", ";", $0); sub(";$", "", $0); print ">"$0 }' seq.fa \
    | head -n 2 \
    | tr ',' '\n'

It assumes the semicolon ; doesn't occur in the sequence names.

Replace head with tail to get the last sequences.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 9.7 years ago by dariober 15k

0

Entering edit mode

Hi, dariober thank you so much for this script. It works well

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 9.7 years ago by vahapel ▴ 210

Ram · Answer 2 · 2015-02-20

4

Entering edit mode

9.7 years ago

Brian Bushnell 20k

For the first 100k sequences:

reformat.sh in=data.fasta out=100k.fasta reads=100000

You can also get a random 100k sequences with Reformat, but not the last 100k.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian, thank you for your help and introducing "bbmap" tools for me, it makes my project more easy

ADD REPLY • link 9.7 years ago by vahapel ▴ 210

0

Entering edit mode

You're welcome!

ADD REPLY • link 9.7 years ago by Brian Bushnell 20k

Ram · Answer 3 · 2015-02-20

2

Entering edit mode

9.7 years ago

5heikki 11k

Assuming no linebreaks in sequences, i.e every record is exactly two lines. I believe for fastq files the multiplier would be 4:

head -n 200000 input > output

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 9.7 years ago by 5heikki 11k

0

Entering edit mode

thank you, 5heikki for your help, it is very simple and useful command.

ADD REPLY • link 9.7 years ago by vahapel ▴ 210

0

Entering edit mode

likewise, use tail for the last N sequences.

ADD REPLY • link 7.0 years ago by st.ph.n ★ 2.7k

score 2 · Answer 4 · 2019-06-10

2

Entering edit mode

5.4 years ago

AK ★ 2.2k

Alternatives: seqkit head and seqkit range:

(1) Leading 100000 records:

seqkit head -n 100000 input.fa
seqkit range -r 1:100000 input.fa

(2) Last 100000 records:

seqkit range -r -100000:-1 input.fa

(3) Other ranges:

seqkit range -r 100001:200000 input.fa

ADD COMMENT • link 5.4 years ago by AK ★ 2.2k

Ram · Answer 5 · 2017-10-18

0

Entering edit mode

7.0 years ago

Simply Bioinformatics ▴ 200

REQUESTED_LINES=10
awk "/^>/ {n++} n>$REQUESTED_LINES {exit} {print}" input.fasta > output.fasta

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 7.0 years ago by Simply Bioinformatics ▴ 200

score 0 · Answer 6 · 2019-06-09

0

Entering edit mode

5.4 years ago

psschlogl ▴ 50

hdl = gzip.open(file, 'rt')
records = SeqIO.parse(hdl, 'fastq')
first_read = next(records)
printfirst_read.id)

ADD COMMENT • link 5.4 years ago by psschlogl ▴ 50

1

Entering edit mode

Hi psschlogl

Though your answer shows some potential it does not really answer the question posted. If you can provide a more complete/correct answer we could keep it in the answers thread. If not, we consider removing it.

ADD REPLY • link 5.4 years ago by lieven.sterck 15k