Question

How To Discard Sequences Longer Than N Nucleotides ?

1

Entering edit mode

12.2 years ago

2011101101 ▴ 110

Hi Everyone,I have some small RNAs data.The length of reads are between 18 to 36. Now,I want to discard sequences longer than 27 nucleotides. How can I do?
There is an example. I have a file named mss.fasta.The contents of the following is the file:

>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

Now I need the retain the sequence length short than 28.This result of example is that.

>hb3_563308_x1
GGGATCGACGGTTCGGTCGGTGC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA

How can I do ? Thank you very much !

small rna data • 11k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 12.2 years ago by 2011101101 ▴ 110

0

Entering edit mode

Assuming reads are fastq format?

ADD REPLY • link 12.2 years ago by Neilfws 49k

0

Entering edit mode

clearly not, after edit :)

ADD REPLY • link 12.2 years ago by Neilfws 49k

Ram · Answer 1 · 2013-02-03

13

Entering edit mode

12.2 years ago

Pierre Lindenbaum 166k

Using awk for a fastq file:

gunzip -c file.fastq.gz| \
awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

update: from your example ( a fasta input):

cat input.fa| \
awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])<=27) {printf("%s\n%s\n",L[0],L[1]);}}'

ADD COMMENT • link 12.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi

Thanks for sharing this awk one liner. Could you please explain it abit in detail, I am leaning awk one liners. This command is really awesome!!

ADD REPLY • link 12.2 years ago by Rahul Sharma ▴ 660

2

Entering edit mode

'i' is the current number of lines,'y' is the index of the line for the fastq record and is just the modulo of i/4. 'y' can be 0,1,2 or 3. we put all the lines '$0' in an associative array 'L' with 'y' as the key. if y==3, we have reached the end of the record: we print the content 'L' if needed.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

excellent job!! thnks

ADD REPLY • link 12.2 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

I enter the following command.

cat mss.fasta | awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

,I can't get the result I needed.

ADD REPLY • link 12.2 years ago by 2011101101 ▴ 110

0

Entering edit mode

because for your example, it's not a FASTQ input but a FASTA, change 4 to 2.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you very much ! I get that.

ADD REPLY • link 12.2 years ago by 2011101101 ▴ 110

0

Entering edit mode

I have another question .If I want to retain the length between 30 to 33 ,what can I do?

cat input.fa | awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])>=30 && length(L[1])<=33) {printf("%s\n%s\n",L[0],L[1]);}}'

ADD REPLY • link 12.2 years ago by 2011101101 ▴ 110

1

Entering edit mode

Best to post as new question rather than ask in the comments. In this case though, it would probably be closed for being too similar to the original question. I think you can come up with a solution yourself, based on what people have contributed here.

ADD REPLY • link 12.2 years ago by Neilfws 49k

0

Entering edit mode

This is great. I had a need for this function and I wrapped the awk command here into a bash script (Linux). Two scripts, actually. One for fastq, one for fasta. Fastq script handles singe read, paired ends, and keeps 1 or 2 indexes in phase with your filtered output. To help you decide what to filter on, I also wrote two accompanying scripts to generate length histograms of your fastq or fasta file(s). Clone my github repo and add the cloned directory to your path to gain functionality.

The commands you would use are: filter_fastq_by_length.sh, filter_fasta_by_length.sh, fastq_length_histogram.sh, and fasta_length_histogram.sh.

They work great with miseq data, but I might need to change the grep strings to work with hiseq data. Shoot me a note here or github if you run into any issues.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.6 years ago by alk224 ▴ 10

Ram · Answer 2 · 2013-02-03

6

Entering edit mode

12.2 years ago

Vince Buffalo ▴ 470

Or, use bioawk, which deals with gzipped files and FASTQ natively. I recommend it more than rolling your own FASTQ parser each time.

For example:

bioawk -cfastx '{if (length($seq) <= 100) {print "@"$name"\n"$seq"\n+\n"$qual}} ' file.fq.gz

Note: this only uses FASTQ names (before the first space); it will discard everything after.

ADD COMMENT • link 12.2 years ago by Vince Buffalo ▴ 470

1

Entering edit mode

The example above is in fasta format, not fastq.

ADD REPLY • link 12.2 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Intentional; showing readers how to do it in FASTQ allows them to extend it to FASTA. Anyone that can't adapt this to FASTA probably shouldn't be copying and pasting commands from the internet into their shells.

ADD REPLY • link 12.2 years ago by Vince Buffalo ▴ 470

0

Entering edit mode

Okay, I just thought you misread it...

ADD REPLY • link 12.2 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

I apologize if this is a really basic awk/bioawk question, but if I use this command, do I need to provide a new output name for the file that contains only the <= 100 bp reads, e.g. by adding a > newfile.fq.gz, or, does this command overwrite the file.fq.gz with only the appropriately sized reads? Thanks!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by aeberardi • 0

Ram · Answer 3 · 2013-02-04

3

Entering edit mode

12.2 years ago

Daniel ★ 4.0k

In true "there's more than one way to do it" form, I present perl:

#!/usr/bin/perl

$head;
while (<>){
        if($_ =~ />/){
                $head = $_;
        }elsif(length($_) < 27){
                print "$head" . "$_";
        }
}

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.2 years ago by Daniel ★ 4.0k

0

Entering edit mode

I dont really dig oneliners so not sure how to convert it but I'm sure it's pretty easy.

ADD REPLY • link 12.2 years ago by Daniel ★ 4.0k

Ram · Answer 4 · 2013-02-04

2

Entering edit mode

12.2 years ago

AGS ▴ 250

faFilter from the Genome Browser utilities

faFilter -maxSize=27

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.2 years ago by AGS ▴ 250

Ram · Answer 5 · 2013-02-03

1

Entering edit mode

12.2 years ago

14134125465346445 ★ 3.6k

Do you want to discard them altogether or trim them to 27bp?

git clone https://github.com/lh3/seqtk.git
cd seqtk/
make
./seqtk trimfq -l 27 file.fq > file.trim.fq

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.2 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

Thank you .I need to discard them altogether nor trim them to 27bp.

ADD REPLY • link 12.2 years ago by 2011101101 ▴ 110

Ram · Answer 6 · 2013-02-04

1

Entering edit mode

12.2 years ago

Martin A Hansen 3.0k

Using Biopieces:

read_fasta -i in.fna | grab -e 'SEQ_LEN > 30' | write_fasta -o out.fna -x

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.2 years ago by Martin A Hansen 3.0k

Ram · Answer 7 · 2013-02-04

0

Entering edit mode

12.2 years ago

yangzhenzhen1988 ▴ 30

>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

#put the above read fasta text into a file named text.txt
#use the following perl script, name it test.pl

#!/usr/bin/perl -w
use strict;
my $id ="";
my %seq;
open IN,"<test.txt";
while(<IN>) {
  chomp;
  if(/^>/) {
    $id = $_;
}
else {
    $seq{$id} = $_;
    }
}
close IN;

foreach my $key (keys %seq) {
    if(length($seq{$key}) <=27) {
        print "$key\n$seq{$key}\n";
    }
}

#run this script by typing in terminal: perl test.pl

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 12.2 years ago by yangzhenzhen1988 ▴ 30

2

Entering edit mode

This hurts my eyes ;oP: a Perl three liner.

ADD REPLY • link 12.2 years ago by Martin A Hansen 3.0k

0

Entering edit mode

It hurt my eyes because the formatting was really bad. Please, make an effort to format correctly (e.g. indent lines of code with 4 spaces). Answers are auto-previewed as you type, so you can see how it will look.

ADD REPLY • link 12.2 years ago by Neilfws 49k