Question

Extracting last N bases of reads in FASTQ file

1

Entering edit mode

8.7 years ago

lebrigand ▴ 10

Hello, I can't find an easy solution to extract the last N bases of all reads from a FASTQ file, is there an easy solution with prinseq-lite, cutadapt or Fastx-toolkit to do it ? Or an other tool. I know i can do it on my own with either a java or a perl script, but well it seems so obvious that one of those program can do it in a few seconds, but i do not find the way to do it. thanks a lot to save me time, and thanks for this so resourcefull forum. kevin

rna-seq sequence • 8.7k views

ADD COMMENT • link updated 4.8 years ago by brianpenghe ▴ 80 • written 8.7 years ago by lebrigand ▴ 10

1

Entering edit mode

You mean remove them?

ADD REPLY • link 8.7 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

Do you want to remove last N bases from a fastq file or want to extract them into new file?

ADD REPLY • link 8.7 years ago by venu 7.1k

2

Entering edit mode

Answers below cover both possibilities.

ADD REPLY • link 8.7 years ago by GenoMax 148k

1

Entering edit mode

I was puzzled by the same thing.

ADD REPLY • link 8.7 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Hi all, thanks for the answers but it seems not working as i want. First all the sed command does not output fastq file so it is useless for me. Concerning the bbduk options, i did not try bbduk yet but it really seems interesting and maybe it can do the stuff, but this command line does not work. Let me explain better what i want, here is the first reads of my input.fastq file:

@NB500XXXXXX

CGCTCNAGAACAGGGTTTGTTAAGAGTAC

+

AAAAA#EEEEEEEEEEEEEEE/EEEEEEE

What i want as output is the last 4 bases in a fastq format file:

@NB500XXXXXX

GTAC

+

EEEE

NB: cat input.fastq | /share/apps/local/bbmap/bbduk.sh in=stdin.fastq forcetrimright=3 out=output.fq minlength=3

--> output.fastq contains the first 4 bases not the last ones

Note that reads can have differents length, what i want is really the last 4 bases of all reads in my fastq file in a fastq format (including QV) thanks for helping me ! kevin.

ADD REPLY • link 8.7 years ago by lebrigand ▴ 10

1

Entering edit mode

If future use the "Add comment"/"Add reply" options against the respective answers/comments to provide additional information. Do not add a "new answer".

As for the BBMap solution I suggest that you use reformat.sh like so

reformat.sh in=input.fastq forcetrimleft=[seq_length - bases you want to keep] out=right.fastq

Replace [seq_length - bases you want to keep] this part with a real number.

Edit: I see that reads can have different length so this solution would not work.

ADD REPLY • link 8.7 years ago by GenoMax 148k

1

Entering edit mode

It would help if the original question gave the exact details of what you want. The way it is worded makes it seem like all you wanted was the last four bases of each read simply printed out. Truncating fastq entries down to the last 4bp is a different problem.

ADD REPLY • link 8.7 years ago by pld 5.1k

score 4 · Answer 1 · 2016-04-27

4

Entering edit mode

8.7 years ago

pld 5.1k

sed -n '2~4p' myfile.fq | grep -o '.\{5\}$'

This will print the last 5 characters. The sed command prints the sequences (every 4th line of the fastq, starting with the second), the grep grabs the last n characters.

ADD COMMENT • link 8.7 years ago by pld 5.1k

3

Entering edit mode

+1

You can also do it all in sed: sed -rn 's/.*(.{3})/\1/; 2~4p' myfile.fq for the last 3 characters, for example

ADD REPLY • link 8.7 years ago by george.ry ★ 1.2k

1

Entering edit mode

Note. This solution does not keep the output reads in fastq format.

ADD REPLY • link 8.7 years ago by GenoMax 148k

1

Entering edit mode

Note. This solution in current form does not keep the output reads in fastq format.

ADD REPLY • link 8.7 years ago by GenoMax 148k

1

Entering edit mode

OP didn't ask for a way to truncate fastq entries, just to pull the last n bp of each read out of a fastq file.

ADD REPLY • link 8.7 years ago by pld 5.1k

0

Entering edit mode

Hi @pld, thanks for sharing this. What would be used if I wanted to exrtact the first and last 5 bases of a fastq file? What should be typed in differently for the second line after the pipeline?

ADD REPLY • link 5.7 years ago by lovegenetics ▴ 70

1

Entering edit mode

You should generally not ask new questions in answers that are not directly related.

As for your question: you could simply do zcat file.fastq.gz | head -20 to get first five reads. If your file is not compressed then just head -20 file.fastq will suffice.

ADD REPLY • link 5.7 years ago by GenoMax 148k

0

Entering edit mode

Hi @genomax, thank you for your reply. but I have just modified my question, I meant how I could extract the first and last N bases from a read in a fastq file?

I have used the following command to extract the last 1000 bases of a read from a fastq file but I'd like to incorportate the first 1000 bases to the command as well:

Blockquote

$$ grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | grep -o '.{1000}$'

ADD REPLY • link 5.7 years ago by lovegenetics ▴ 70

score 4 · Answer 2 · 2016-04-27

4

Entering edit mode

8.7 years ago

John 13k

I don't have time to check it on a FASTQ file, but in python:

with open('input.fastq','r') as f:
    toggle = False
    for line in f:
        if toggle: print line[-5:-1]
        else:      print line[ 0:-1]
        toggle = not toggle

Change the path of input.fastq, save the code as dangerous.py, and run python dangerous.py > new.fastq

Works on reads of different lengths, and outputs the format you were looking for.

ADD COMMENT • link 8.7 years ago by John 13k

2

Entering edit mode

Confirmed to work as advertised :-)

ADD REPLY • link 8.7 years ago by GenoMax 148k

score 3 · Answer 3 · 2017-03-06

3

Entering edit mode

7.8 years ago

shenwei356 8.7k

I'm wondering why the No.1 user Biostar modified some threads and brought them to homepage these days.

But seqkit can do this easily.

N=100
gzip -d -c read_1.fq.gz | seqkit subseq -r -$N:-1 | gzip -c > read_1.t.fq.gz

The definition of region is 1-based and with some custom design.

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n
          1:12    A C G T N a c g t n
        -12:-1    A C G T N a c g t n

Example:

$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC"                        
@seq
ACGTNacgtn
+
GGGGGCCCCC

first 6 bases:

$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC" | seqkit subseq -r 1:6  
@seq
ACGTNa
+
GGGGGC

last 6 bases:

$ echo -e "@seq\nACGTNacgtn\n+\nGGGGGCCCCC" | seqkit subseq -r -6:-1
@seq
Nacgtn
+
GCCCCC

ADD COMMENT • link 7.8 years ago by shenwei356 8.7k

1

Entering edit mode

I'm wondering why the No.1 user Biostar modified some threads and brought them to homepage these days.

That is a programmed feature in Biostars that "bumps" old threads (I am sure there is some logic that @Istvan can comment on) to keep the main page periodically populated.It provides a chance for (relatively) new users like you to offer fresh takes on answers/comments.

ADD REPLY • link 7.8 years ago by GenoMax 148k

2

Entering edit mode

Since it's hard to find out what the robot modified. I tend to accept your guessing.

Yes, although I registered 4 years ago, I just started to comment few months ago :P

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

score 2 · Answer 4 · 2016-04-27

2

Entering edit mode

8.7 years ago

michael.ante ★ 3.9k

Hi Kevin,

give bbduk.sh from the bbmap suite a try. It will be something like:

cat myfile.fq | bbduk.sh in=stdin.fq forcetrimright=$N  out=stdout.fq minlength=$N

AFAIK, bbduk is 0-based; so if you want to have the last 5 nt, your N has to be 4.

Cheers,

Michael

ADD COMMENT • link 8.7 years ago by michael.ante ★ 3.9k

0

Entering edit mode

I'm sorry. In my toy example it seemed to work.

If you want to make it complicated: Use reformat.sh to make the reverse complement, get the first N reads, and change it back with another reverse complement:

cat myfile.fq | reformat.sh in=stdin.fq  out=stdout.fq rcomp=t | bbduk.sh in=stdin.fq forcetrimright=$N  out=stdout.fq minlength=$N | reformat.sh in=stdin.fq  out=stdout.fq rcomp=t

Anyway, I'd go for John's Python solution.

ADD REPLY • link 8.7 years ago by michael.ante ★ 3.9k

score 0 · Answer 5 · 2020-03-01

0

Entering edit mode

4.8 years ago

brianpenghe ▴ 80

To get an output in fastq format (say last 20 basepair) you have to do this:

paste <(cat my.fastq | paste - - | cut -f1 | tr "\t" "\n") <(cat my.fastq | paste - - | cut -f2 | tr "\t" "\n" | grep -o '.{20}$' ) | tr "\t" "\n"

ADD COMMENT • link 4.8 years ago by brianpenghe ▴ 80

0

Entering edit mode

This command does not produce a meaningful output, at least not on my Macbook. I guess you would need to use egrep, then it should be fine I guess. Correct me if I am wrong. Is this different on Linux?

ADD REPLY • link 4.8 years ago by ATpoint 86k