Question

Weird base qualities and sequences from FastQ file?

0

Entering edit mode

8.0 years ago

germelcar ▴ 20

Hi everyone:

I have some difficulties, it is new and weird for me and I need some help with recommendations/advices.

I realize that the fastq files have a lot lot of reads with qualities marked as "#" and the bases are marked as "N" for the entire read, and in some other reads, the bases are maked fine for some couple of bases and later the read finish with a lot of "N". Here are the head and tail of 2 spots and some extract from the middle of the /1 read.

head

@SRR1812885.1 HWUSI-ES1807:12:FC:7:1:7609:1000/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################
@SRR1812885.2 HWUSI-ES1807:12:FC:7:1:10872:1000/1 ]
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################

tail

@SRR1812885.70669746 HWUSI-ES1807:12:FC:7:120:13034:23950/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################
@SRR1812885.70669747 HWUSI-ES1807:12:FC:7:120:3941:23950/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
######################################################################

some part of the /1.fastq file

@SRR1812885.415516 HWUSI-ES1807:12:FC:7:2:4966:1022/1
NCTCAAGTCATCATGTTCTTGATGTTTACGACGATAGTCTTGTTCAAACCTATCCAATGCATGCAATTCT
+
######################################################################
@SRR1812885.415517 HWUSI-ES1807:12:FC:7:2:16684:1022/1
NGACTGGCACACCCGCTACAAAATTATCAAGGGAACCTGCGAGGGCCTAAAATATCTTCATGAGTTGATG
+
#*((*,-,-+@@@@@@@8@@################################################## 
@SRR1812885.415518 HWUSI-ES1807:12:FC:7:2:1971:1022/1
NGTCTTTGTACAATCTCTTCCACCAATACACAGCATCCATATAATGTAGGATCATCAGCAACCTGTAAAC
+
#*++)-//-/@@@@@@C@@@:<:<<25777@@@@@837997745598979:::::<<8802222211433
@SRR1812885.415519 HWUSI-ES1807:12:FC:7:2:19373:1022/1
NTGGGCATAGGTTATATCTATTTTGCCAGTCAGCATGTTGCAGCTATTTCAAGGCATGGTGTTCTATGCT
+
#)).(+,-+*+22210,-0,:::22@@0@@8::::@@5@@##############################
@SRR1812885.415520 HWUSI-ES1807:12:FC:7:2:5742:1022/1
NCGCCATCTGAGAAAAGCACGCCTTGCCACAAGCTCCTTTCCATTGCGTTCTCTGCGTGCAGCATCTGCT
+
######################################################################
@SRR1812885.415521 HWUSI-ES1807:12:FC:7:2:1502:1023/1
NAATTCCATACTTTGAATACTAGTTATGAGGTGATACTTAGGGACAAAGCAGTCTTTTCAAAAATCCAAG
+
######################################################################

Is everything OK? It is only that it looks strange for me, never before I have been saw something similar. Thanks in advance.

RNA-Seq next-gen Assembly fastq • 2.3k views

ADD COMMENT • link updated 8.0 years ago by mastal511 ★ 2.1k • written 8.0 years ago by germelcar ▴ 20

0

Entering edit mode

How was the data obtained? Which processing steps were performed?

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi WouterDeCoster, thanks for your reply.

I will put a copy in verbatim of the "library preparation and Illumina sequencing" section of the paper:

The quality of total RNA was checked using a NanoDrop 2000 (Thermo Fisher, USA) and an Agilent 2100 Bioana- lyzer (Agilent, USA). The RNA Integrity Number of RNA obtained was greater than 8. mRNA was purified from total RNA by RNA purification beads, then fragmented and primed for cDNA synthesis according to the manufactur- er’s instructions (Illumina, USA). Double-stranded cDNA was synthesized using the SuperScript Double-Stranded cDNA Synthesis kit (Invitrogen, USA) and then purified using Agencourt AMPure XP beads (Beckman Coulter, Inc, USA). End repairing and 3′-ends adenylation were per- formed following the RNA adapters ligation. After enrich- ment of DNA fragments library templates were validated using the Agilent 2100 Bioanalyzer (Agilent, USA). Using TruSeq PE Cluster Kit v2 and cBot automated system (Illu- mina, USA) clonal clusters were created from DNA library templates. Clusters obtained were finally used to perform paired-end runs by Genome Analyzer IIx (Illumina, USA).

I am not sure if that is what you are asking for. Thanks in advance.

ADD REPLY • link 8.0 years ago by germelcar ▴ 20

score 2 · Answer 1 · 2016-12-04

2

Entering edit mode

8.0 years ago

GenoMax 147k

It is possible that your sequence provider included reads that had failed internal illumina quality control in this output. You can use bbduk.sh from BBMap with qtrim=rl trimq=1 to remove those N calls.

@mastal511 makes a perfect observation, which is the likely cause.

ADD COMMENT • link 8.0 years ago by GenoMax 147k

score 2 · Answer 2 · 2016-12-04

2

Entering edit mode

8.0 years ago

mastal511 ★ 2.1k

I notice that the read headers start with @SRR1812885, indicating that this is data from the SRA. This looks like older sequencing data, where it was typical to find low quality reads at the edges of the flowcell, which ended up at the start and the end of the fastq files. It was quite typical to have reads made up of Ns and very low base qualities like '#' at the beginning and end of the fastq files. As long as most of the data in the middle of the file is OK, it looks normal.

ADD COMMENT • link 8.0 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Hi mastal511, thanks for your reply.

I do not know what do you mean by older sequencing data, how new sequencing data look like? I have understood that the data was sequenced using HiSeq 2000, which I have understood is not too old.

Do you recommend me to use the reads without a processing/filtering (like quality trimming) them?

Thanks in advance.

ADD REPLY • link 8.0 years ago by germelcar ▴ 20

1

Entering edit mode

No, you should definitely use something like trimmomatic or other quality/adapter trimmer to get rid of low quality reads/parts of reads.

I looked at the SRA page, too. In one section it said HiSeq2000, in another section it said GAII. It looks much more like what I used to see from GAII data, but it may just depend on to what extent the data has been filtered by the sequencer software.

ADD REPLY • link 8.0 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Thanks for your help. The authors say the following about the filtering process about the reads:

70 bp paired-end reads were prepared for assembly by Q15 filtering, removal of library adapter sequences, removal of A/T stretches and short reads (less than 60 bp). The percent of rejected short reads was 0.6 %.

I have understood that the recommended quality threshold is >= 30 (keeping those reads with that quality). Should I use a quality trimming with that threshold?

Many thanks.

ADD REPLY • link 8.0 years ago by germelcar ▴ 20

0

Entering edit mode

Try using trimmomatic with some of the default threshold values, I'm not sure if it has a threshold for the quality of the whole read.

ADD REPLY • link 8.0 years ago by mastal511 ★ 2.1k

0

Entering edit mode

Thanks. I will try trimmomatic and also BBMap as @genomax2 has suggested me.

Many thanks for the help.

ADD REPLY • link 8.0 years ago by germelcar ▴ 20