Question

Manipulate Sequences In Fastq Files

2

Entering edit mode

12.2 years ago

lsvijfhuizen ▴ 90

Dear All,

I have 20x illumina sequences data in large fastq files. Each file contains a sequence length of 21 nucleotides. I would like to remove the first 4 nucleotides from all reads in the files.

i.e.

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
CATGATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
CATGATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
CATGAAGACAAAGCCTCTATGA

to

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
ATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
ATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
AAGACAAAGCCTCTATGA

I am new to bioinformatics and would appreciate a few pointers on the best way to get this done with the command line in Linux. Thanks, Lisanne

fastq sequence • 11k views

ADD COMMENT • link updated 12.2 years ago by JC 13k • written 12.2 years ago by lsvijfhuizen ▴ 90

0

Entering edit mode

Edited my answer; presumably, you also want to remove the first 4 characters of quality score?

ADD REPLY • link 12.2 years ago by Neilfws 49k

score 9 · Answer 1 · 2012-10-25

sed is your friend.

sed '2~4s/^\(.\{4\}\)//' myfile > newfile

Translated that says: starting from line 2, substitute the first 4 characters every 4th line with nothing (i.e. remove them).

If you don't want to write to newfile, run:

sed -i '2~4s/^\(.\{4\}\)//' myfile

to edit myfile "in place".

EDIT

I think you will also want to remove the corresponding characters from the quality score lines. So you should run:

sed '2~2s/^\(.\{4\}\)//' myfile > newfile

Other useful command line tools for text processing: grep, awk, cut, paste, head, tail.

score 6 · Answer 2 · 2012-10-25

As an alternative approach - the FASTX-Toolkit provides a number of command line utilities for manipulating FASTQ and FASTA sequence files.

fastx_trimmer has lots of options for trimming sequences in a variety of ways. To achieve what you are after (trim first 4 bases from each read):

fastx_trimmer -f 5 -z -i infile.fq.gz -o outfile.fq.gz

The -f 5 option is the key one here, this says that the 5th nucleotide of each sequence is the first one you want to keep (i.e. the first 4 are discarded).

score 1 · Answer 3 · 2012-10-25

1

Entering edit mode

12.2 years ago

JC 13k

Perl-one-liner, trimming read sequence and qualities lines:

cat file.fq | perl -plane '$ln++; s/^....// if ($ln % 2 == 0)' > trimmed.fq

ADD COMMENT • link 12.2 years ago by JC 13k