Question

What could be wrong with my FASTQ files? Picard suggests that there is missing header information.

1

Entering edit mode

7.6 years ago

kmurph55 ▴ 10

Hello, I have two fastq files 3D_1.fastq and 3d_2.fastq. To the best of my knowledge the first file contains forward reads and the second file contains reverse reads. I am able to confirm that the fastq files were generated as paired end reads, 101 base pairs in length, and have Illumina/sanger 1.9+ encoding. The data files that I have are the nucleotide sequences from a single sample and from a highseq machine. For some reason I am getting an error message from Picard that indicates a lack of read group information in the header of my files. I used Bowtie2 to map the reads against a reference genome and used the sorted bam file as the input file in order to validate its information in Picard. These are the first few lines from my first fastq file

 @SN996:194:H5V7HBCXY:1:1108:1872:2028 1:N:0:TCTCGCGC
NTATTTCATAGCATACTTTTCCGGGCTCGCCGGGCCTAAGAAAGTTGCAAAAATTTTTCAATCGAAATACAAATGAAATTAAAACCTACGCGCGTGTGTGG
+
DHHIIIIIIIIIHHFHIHHIHGIIICHHGHIIHIHHHEHIDGHHFEHIHGHHIIHIIHGIIIIIHIHIIECHIIGFFHHIHIHCFHIIG<<E0CFHH
@SN996:194:H5V7HBCXY:1:1108:1995:2062 1:N:0:TCTCGCGC
CATCGATATGTATTTCTATTAACAAATTGCAAACATTACGATTAAATGAAAGAGTTGTGGCGTCCCTCGTTCTTGACCCGCGGACTGACTCACAGTCCCGA

These are the first few lines from my second fastq file

@SN996:194:H5V7HBCXY:1:1108:1872:2028 2:N:0:TCTCGCGC
GCCGGCGGCAGTTTGTGCATTGCTTTTGAAGTGGCAACAATTTCGCCACGATTCTCTTGGTCTTTCTTCGGTTGCTGTTGCTGGAGGAGCCTCCATTATTC
+
DDCDCIICC<ECDHHHEHIHGHEFGGHIHEHHIIIIH?GH1CHH?EGHHHCE<1D@1<<@<FEEFCF1GHHIFHC1<F<<@<E111<EEEHHIIIG1CCD1
@SN996:194:H5V7HBCXY:1:1108:1995:2062 2:N:0:TCTCGCGC
CTGACCGCAGTGAATCGGAAGGTGGCCTACGAGTACCAGTCGAATACGAAGAACGAGGCCCTCAACCAGATGAAGGAAATGCCCAACTTTATGTCGACACT

I know that the fastq files were generated from a single sample, so it would make sense that they do not contain Read Group identification because all reads belong to only a single sample. I would assume that it is fairly common to have sequencing done on a single sample and that if this information was 100% necessary to have in the header that the sequencing company would have formatted the data in such a way that it would not prevent downstream analyses. For what reason would I be getting this error in Picard? Does anyone have a suggestion on how to move past this issue?

sequencing software error • 2.5k views

ADD COMMENT • link updated 7.6 years ago by Santosh Anand 5.8k • written 7.6 years ago by kmurph55 ▴ 10

0

Entering edit mode

is the space before the " @SN996" is a copy+paste problem when you' ve written the current post ? If not, this is your problem.

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Yes this was just an error that I made in my post.

ADD REPLY • link 7.6 years ago by kmurph55 ▴ 10

0

Entering edit mode

Illumina highseq for all your stoner sequencing!

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

score 3 · Answer 1 · 2017-04-07

Picard is a complementary toolset of GATK, and the latter obliges you to add RG information for each read and in header (and so Picard too). The RG info is added by user, according to these guidelines

http://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups

First decide what your RG (ReadGroup) string would be according to above, and since you have already mapped the reads, it is easier to add RG info using another picard tool AddOrReplaceReadGroups

From next time, You may also enter the RG-info at mapping time. Bowtie can do it by

--rg-id <text>
Set the read group ID to <text>. This causes the SAM @RG header line to be printed, with <text> as the value associated with the ID: tag. It also causes the RG:Z: extra field to be attached to each SAM output record, with value set to <text>.

Remember that RG-info is absolutely necessary for most of the GATK analysis