How To Discard Sequences Longer Than N Nucleotides ?
7
1
Entering edit mode
11.9 years ago
2011101101 ▴ 110

Hi Everyone,I have some small RNAs data.The length of reads are between 18 to 36. Now,I want to discard sequences longer than 27 nucleotides. How can I do?
There is an example. I have a file named mss.fasta.The contents of the following is the file:

>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

Now I need the retain the sequence length short than 28.This result of example is that.

>hb3_563308_x1
GGGATCGACGGTTCGGTCGGTGC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA

How can I do ? Thank you very much !

small rna data • 10k views
ADD COMMENT
0
Entering edit mode

Assuming reads are fastq format?

ADD REPLY
0
Entering edit mode

clearly not, after edit :)

ADD REPLY
13
Entering edit mode
11.9 years ago

Using awk for a fastq file:

gunzip -c file.fastq.gz| \
awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

update: from your example ( a fasta input):

cat input.fa| \
awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])<=27) {printf("%s\n%s\n",L[0],L[1]);}}'
ADD COMMENT
0
Entering edit mode

Hi

Thanks for sharing this awk one liner. Could you please explain it abit in detail, I am leaning awk one liners. This command is really awesome!!

ADD REPLY
2
Entering edit mode

'i' is the current number of lines,'y' is the index of the line for the fastq record and is just the modulo of i/4. 'y' can be 0,1,2 or 3. we put all the lines '$0' in an associative array 'L' with 'y' as the key. if y==3, we have reached the end of the record: we print the content 'L' if needed.

ADD REPLY
0
Entering edit mode

excellent job!! thnks

ADD REPLY
0
Entering edit mode

I enter the following command.

cat mss.fasta | awk '{y= i++ % 4 ; L[y]=$0; if(y==3 && length(L[1])<=27) {printf("%s\n%s\n%s\n%s\n",L[0],L[1],L[2],L[3]);}}'

,I can't get the result I needed.

ADD REPLY
0
Entering edit mode

because for your example, it's not a FASTQ input but a FASTA, change 4 to 2.

ADD REPLY
0
Entering edit mode

Thank you very much ! I get that.

ADD REPLY
0
Entering edit mode

I have another question .If I want to retain the length between 30 to 33 ,what can I do?

cat input.fa | awk '{y= i++ % 2 ; L[y]=$0; if(y==1 && length(L[1])>=30 && length(L[1])<=33) {printf("%s\n%s\n",L[0],L[1]);}}'
ADD REPLY
1
Entering edit mode

Best to post as new question rather than ask in the comments. In this case though, it would probably be closed for being too similar to the original question. I think you can come up with a solution yourself, based on what people have contributed here.

ADD REPLY
0
Entering edit mode

This is great. I had a need for this function and I wrapped the awk command here into a bash script (Linux). Two scripts, actually. One for fastq, one for fasta. Fastq script handles singe read, paired ends, and keeps 1 or 2 indexes in phase with your filtered output. To help you decide what to filter on, I also wrote two accompanying scripts to generate length histograms of your fastq or fasta file(s). Clone my github repo and add the cloned directory to your path to gain functionality.

The commands you would use are: filter_fastq_by_length.sh, filter_fasta_by_length.sh, fastq_length_histogram.sh, and fasta_length_histogram.sh.

They work great with miseq data, but I might need to change the grep strings to work with hiseq data. Shoot me a note here or github if you run into any issues.

ADD REPLY
6
Entering edit mode
11.9 years ago
Vince Buffalo ▴ 470

Or, use bioawk, which deals with gzipped files and FASTQ natively. I recommend it more than rolling your own FASTQ parser each time.

For example:

bioawk -cfastx '{if (length($seq) <= 100) {print "@"$name"\n"$seq"\n+\n"$qual}} ' file.fq.gz

Note: this only uses FASTQ names (before the first space); it will discard everything after.

ADD COMMENT
1
Entering edit mode

The example above is in fasta format, not fastq.

ADD REPLY
0
Entering edit mode

Intentional; showing readers how to do it in FASTQ allows them to extend it to FASTA. Anyone that can't adapt this to FASTA probably shouldn't be copying and pasting commands from the internet into their shells.

ADD REPLY
0
Entering edit mode

Okay, I just thought you misread it...

ADD REPLY
0
Entering edit mode

I apologize if this is a really basic awk/bioawk question, but if I use this command, do I need to provide a new output name for the file that contains only the <= 100 bp reads, e.g. by adding a > newfile.fq.gz, or, does this command overwrite the file.fq.gz with only the appropriately sized reads? Thanks!

ADD REPLY
3
Entering edit mode
11.9 years ago
Daniel ★ 4.0k

In true "there's more than one way to do it" form, I present perl:

#!/usr/bin/perl

$head;
while (<>){
        if($_ =~ />/){
                $head = $_;
        }elsif(length($_) < 27){
                print "$head" . "$_";
        }
}
ADD COMMENT
0
Entering edit mode

I dont really dig oneliners so not sure how to convert it but I'm sure it's pretty easy.

ADD REPLY
2
Entering edit mode
11.9 years ago
AGS ▴ 250

faFilter from the Genome Browser utilities

faFilter -maxSize=27
ADD COMMENT
1
Entering edit mode
11.9 years ago

Do you want to discard them altogether or trim them to 27bp?

git clone https://github.com/lh3/seqtk.git
cd seqtk/
make
./seqtk trimfq -l 27 file.fq > file.trim.fq
ADD COMMENT
0
Entering edit mode

Thank you .I need to discard them altogether nor trim them to 27bp.

ADD REPLY
1
Entering edit mode
11.9 years ago

Using Biopieces:

read_fasta -i in.fna | grab -e 'SEQ_LEN > 30' | write_fasta -o out.fna -x
ADD COMMENT
0
Entering edit mode
11.9 years ago
>hb3_563308_x1 
GGGATCGACGGTTCGGTCGGTGC 
>hb8_30401_x1
TCTTTGCACGCCCGTTGCGCCCGCGGCCC
>hb7_268329_x1
GAAGGAGAAATCGTGTCTGAACCA
>hb6_975200_x1
TGATGCTTGCTGTGCCTGCACAGATT
>hb1_2324034_x16
TGTTTTCGGATACTGCCA
>hb2_216320_x1
TGCTAACTAGCTATGCGGAGCCATCCCTCCGCATCT

#put the above read fasta text into a file named text.txt
#use the following perl script, name it test.pl

#!/usr/bin/perl -w
use strict;
my $id ="";
my %seq;
open IN,"<test.txt";
while(<IN>) {
  chomp;
  if(/^>/) {
    $id = $_;
}
else {
    $seq{$id} = $_;
    }
}
close IN;

foreach my $key (keys %seq) {
    if(length($seq{$key}) <=27) {
        print "$key\n$seq{$key}\n";
    }
}

#run this script by typing in terminal: perl test.pl
ADD COMMENT
2
Entering edit mode

This hurts my eyes ;oP: a Perl three liner.

ADD REPLY
0
Entering edit mode

It hurt my eyes because the formatting was really bad. Please, make an effort to format correctly (e.g. indent lines of code with 4 spaces). Answers are auto-previewed as you type, so you can see how it will look.

ADD REPLY

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6