trimming reads in fastq file
3
0
Entering edit mode
7.6 years ago

I have a fastq file and at the beginning of all reads I have a "N". how can I get ride of that N using command line? here is an example:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
NGCGACCTCAGATCAGACGTGGCGACCTGGAATTCTCGGGTGCCAAGGAA
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG
@SRR2163140.2 HISEQ:148:C670LANXX:3:1101:1440:1963 length=50
NAGGCCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTC
+SRR2163140.2 HISEQ:148:C670LANXX:3:1101:1440:1963 length=50
#=<BBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
NGCCGACATCGAAGGATCAATGGAATTCTCGGGTGCCAAGGAACTCCAGT
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFF
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
NACAAACCCTTGTGTCGAGGGCTGGAATTCTCGGGTGCCAAGGAACTCCA
RNA-Seq • 4.2k views
ADD COMMENT
0
Entering edit mode

you should be doing some QC on the file anyway so just run it through FASTQC and trimgalore with default settings and this will happen automatically (I think trim galore removes the first 3 nucleotides by default for each read)

ADD REPLY
1
Entering edit mode
7.6 years ago
Buffo ★ 2.4k

Triming them with prinseq-lite; you can trim by 5, 3, max N number etc. What is that? miRNA-seq?

http://prinseq.sourceforge.net/manual.html
ADD COMMENT
1
Entering edit mode
ADD COMMENT
0
Entering edit mode
7.6 years ago
Charles Plessy ★ 2.9k

You can use EMBOSS to trim the first base of sequences in many formats, including FASTQ. In the example below, I saved your sequenced in a file names toto.fq. As you can see, EMBOSS discards the sequence name on the "+" lines, which makes the file quite lighter.

$ seqret fastq-sanger::toto.fq[2:] fastq-sanger::stdout
Read and write (return) sequences
@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACCTGGAATTCTCGGGTGCCAAGGAA
+
<<ABGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG
@SRR2163140.2 HISEQ:148:C670LANXX:3:1101:1440:1963 length=50
AGGCCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCACGATCTC
+
=<BBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAATGGAATTCTCGGGTGCCAAGGAACTCCAGT
+
<<ABFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFF
ADD COMMENT
0
Entering edit mode

That tool seems nice !!

Charles, i have a slightly related question, regarding triming of first base. I was looking at modEncode CAGE data, and i see there is a very high percentage of first base added on Fastq sequence, but not on all sequence. First base is generallt "G" as it is known. I mapped fastq files and i see that TSS is shifted by 1 base. I tried local Vs endToend mapping of bowtie, yet the persist of TSS shifting. Mismatch on first base gives wrong TSS What do you think is the best way to map these reads accurately.

ADD REPLY
0
Entering edit mode

(I just answered in the post that you linked)

ADD REPLY

Login before adding your answer.

Traffic: 2523 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6