Manipulate Sequences In Fastq Files
3
2
Entering edit mode
12.1 years ago
lsvijfhuizen ▴ 90

Dear All,

I have 20x illumina sequences data in large fastq files. Each file contains a sequence length of 21 nucleotides. I would like to remove the first 4 nucleotides from all reads in the files.

i.e.

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
CATGATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
CATGATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
CATGAAGACAAAGCCTCTATGA

to

@D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
ATTTGATATTTAGGGCTT
+
HIFHIEGHIIFHGIIGHIIIDH
@D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
ATGACATAGAAATAATTT
+
IIFIIIIIIIIIIIFIFIIIFI
@D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
AAGACAAAGCCTCTATGA

I am new to bioinformatics and would appreciate a few pointers on the best way to get this done with the command line in Linux. Thanks, Lisanne

fastq sequence • 10k views
ADD COMMENT
0
Entering edit mode

Edited my answer; presumably, you also want to remove the first 4 characters of quality score?

ADD REPLY
9
Entering edit mode
12.1 years ago
Neilfws 49k

sed is your friend.

sed '2~4s/^\(.\{4\}\)//' myfile > newfile

Translated that says: starting from line 2, substitute the first 4 characters every 4th line with nothing (i.e. remove them).

If you don't want to write to newfile, run:

sed -i '2~4s/^\(.\{4\}\)//' myfile

to edit myfile "in place".

EDIT

I think you will also want to remove the corresponding characters from the quality score lines. So you should run:

sed '2~2s/^\(.\{4\}\)//' myfile > newfile

Other useful command line tools for text processing: grep, awk, cut, paste, head, tail.

ADD COMMENT
6
Entering edit mode
12.1 years ago

As an alternative approach - the FASTX-Toolkit provides a number of command line utilities for manipulating FASTQ and FASTA sequence files.

fastx_trimmer has lots of options for trimming sequences in a variety of ways. To achieve what you are after (trim first 4 bases from each read):

fastx_trimmer -f 5 -z -i infile.fq.gz -o outfile.fq.gz

The -f 5 option is the key one here, this says that the 5th nucleotide of each sequence is the first one you want to keep (i.e. the first 4 are discarded).

ADD COMMENT
1
Entering edit mode
12.1 years ago
JC 13k

Perl-one-liner, trimming read sequence and qualities lines:

cat file.fq | perl -plane '$ln++; s/^....// if ($ln % 2 == 0)' > trimmed.fq
ADD COMMENT

Login before adding your answer.

Traffic: 1906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6