Question

Remove contigs that are lower than 200

0

Entering edit mode

4.7 years ago

Bioinfo ▴ 20

Hello , i have e a file contain 2805 contigs the shortest one have a size of 37 Nucleotide and i want to delete all contigs that are lower than 200 Nucleotide can anyone tell me linux command line i can use Thank you

assembly sequencing genome next-gen • 5.3k views

ADD COMMENT • link updated 8 months ago by GenoMax 147k • written 4.7 years ago by Bioinfo ▴ 20

0

Entering edit mode

Hi, how do you calculate the length of your shortest conting? whats command or program did you use?

ADD REPLY • link 8 months ago by carlosgonzalezcruz327 ▴ 20

0

Entering edit mode

Use one of the solutions here: How to find shortest lenth or longest length from fasta file

ADD REPLY • link 8 months ago by GenoMax 147k

score 5 · Accepted Answer · 2020-03-02

5

Entering edit mode

4.7 years ago

andres.firrincieli 3.8k

One option is to use reformat.sh from the bbmap package

reformat.sh in=contigs.fasta out=filtered.fasta minlength=200

ADD COMMENT • link 4.7 years ago by andres.firrincieli 3.8k

0

Entering edit mode

It's working , Thank you so much

ADD REPLY • link 4.7 years ago by Bioinfo ▴ 20

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY • link 4.7 years ago by Ram 44k

score 4 · Accepted Answer · 2020-03-02

You can perform this task if you install BioPerl module Bio::SeqIO. Then you can save the script below as filter_contigs.pl in the same directory as file with contigs and run the script with perl filter_contigs.pl. It will remove contigs that are shorter than 200 bp from input file contigs.fasta and save the output to file contigs_filt_200.fasta.

use Bio::SeqIO;

# Setting minimum length to 200
my $min_len = 200;

# Reading the input fasta file
my $seqio_in = Bio::SeqIO->new(-file => "contigs.fasta", 
                             -format => "fasta" );

# Creating the output fasta file                             
my $seqio_out = Bio::SeqIO->new(-file => ">contigs_filt_200.fasta", 
                             -format => "fasta" );

# Saving sequences to the output if length >= min_len     
while ( my $seq = $seqio_in->next_seq ) {
    if ( $seq->length  >=  $min_len ) {
        $seqio_out->write_seq($seq);
    }
}