Separate the Longest Strain in Bacteria
2
0
Entering edit mode
6.4 years ago
Shelle ▴ 30

I have a FASTA file of bacteria which has multiple strains like below. I would like to go through the FASTA file and extract/separate the largest strain from the FASTA file. Does anyone help me how i can do that?

>NZ_PKKG01000001.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.1, whole genome shotgun sequence
TCTAATTTCACGGCTGATAGTTGATTTATGACAGCCTACTTCTTGAGCGATAGCAGTGCAGGAGGTTATGCCAGAATCAA
GCAGAGCTTGAATTTGTCCACGTTGTTCGCTGTTCAATTGATGATAATGCTTGGAAATGCTAGAATTTGAGTTGGTCATG
AAGATCTTCCTTTCTTGATTTTTGGTCACTTCAAGTTTAGGTCTTCATGGCCTTTTTGTTTAACAATTAGTGTTGCACTT

>NZ_PKKG01000101.1 Lactobacillus crispatus strain UMB1398 .21837_8_80.101, whole genome shotgun sequence
AAGGGCAAGAGCATGAAAGACAAGTCAAAAGCTTATGTTCAACAGGCAACTGATGCCATTAACCACAAATATCGCCGAAT
CCTCCAATATCACACAGCAGAGGAACTCTTCAAGCAATATATCTCTTCATAACCTAACTGTTGCACTTAATTTGACAATT
CAGGCAACTTGTAAATTGAACAAAAAATAGGGAAGATGAAACTAATCATTTCCCTATTTGTTTTTAAGAACAGTATTCAA
...
genome sequence • 1.6k views
ADD COMMENT
0
Entering edit mode
seqkit sort --quiet -2lr input.fa

Outputs the sequences sorted by length and largest one on the top. You can use seqkit sort --quiet -2lr sequence.fa |seqkit head -n 1 for the largest sequence (provided there are no two equal length largest sequences) . Download seqkit from here. If you are a programmer, you can also use index of your fasta file (.fai from samtools faidx) to getting largest sequence.

ADD REPLY
1
Entering edit mode
6.4 years ago

Get the length of each sequence using the following command -


cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' reference

ADD COMMENT
0
Entering edit mode

@doctor.dee005 thanks for your response. This command only lists the length of sequences. How can i actually extract the longest strain and put it in a separate fasta file?

ADD REPLY
0
Entering edit mode

Try this

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1); seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

Explanation

A. Fetching the header of the longest sequence

longest=$(seqkit sort --quiet -lr your.fasta.file.fa | seqkit fx2tab -nl | head -1  | cut -f1);

seqkit sort --quiet -lr your.fasta.file.f : sort by length

seqkit fx2tab -nl: take the lengths of sorted fasta file

head -1 | cut -f1 : first line would have the header and length of the longest sequence. head -1 is for taking first line and cut -f1 is to take just the header. This is saved in the variable called longest.

B. Fetching the sequence

 seqkit grep -n -r -p "$longest" your.fasta.file.fa > longest.fa

grep : fetch header stored in longest variable

-n: by full name

-r: using regex (regular expression)

-p: pattern

ADD REPLY
0
Entering edit mode

@Vijay Please update your variables("$largest" is not defined). One can further shorten :

> longest=$(seqkit sort --quiet -lr input.fa| seqkit seq -n |head -1); seqkit grep -np "$longest" input.fa
ADD REPLY
0
Entering edit mode

@cpad : did that, longest is more appropriate. I liked the shortened version ;)

ADD REPLY
0
Entering edit mode
6.4 years ago

Hello shelle,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

This question has been asked many times. A solution using seqkit

seqkit fx2tab -nl your.fasta.file.fa | sort -nr -k2 | head -1

Thank you!

ADD COMMENT

Login before adding your answer.

Traffic: 2232 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6