HiSeq2500, Whole Genome Resequencing, Paired-ends 101 bp length, but no barcode! Please Help
2
0
Entering edit mode
7.7 years ago
kmurph55 ▴ 10

I have two separate fastq files, one file for each mate of a pair. Each file has a corresponding .txt file, but only offers a summary of quality information. I believe my paired-end reads are multiplexed, but I have no way of identifying them on a unique basis. From my understanding I am supposed to use a barcode.txt file to merge this information into my fastq files, but this is information that I simply do not have. The only way I discovered that this was most likely the problem is by basically failing all further analysis of my data. I am capable of mapping the data and creating bam file etc., but I always ruin into a similar problem which is that I have no read group id information in the header or in other words no unique identifier associated with my reads so that they may be distinguished as different reads. So my questions are ...

Is a barcode.txt file typically provided? Or is it strictly just generated automatically when a fastQC report is conducted?

Is it possible to add fake barcodes to fastq files for the sake of continuing on in a given pipeline?

Is there something that I am missing because I am fairly new to bioinformatics? Here is an example of my fastq headers after joining the reads into mate pairs. Does it look like this is in the correct format?

I AM DESPERATE! please help if you can

@SN996:194:H5V7HBCXY:1:1108:1872:2028 1:N:0:TCTCGCGC NTATTTCATAGCATACTTTTCCGGGCTCGCCGGGCCTAAGAAAGTTGCAAAAATTTTTCAATCGAAATACAAATGAAATTAAAACCTACGCGCGTGTGTGGGCCGGCGGCAGTTTGTGCATTGCTTTTGAAGTGGCAACAATTTCGCCACGATTCTCTTGGTCTTTCTTCGGTTGCTGTTGCTGGAGGAGCCTCCATTATTC

sequencing • 2.1k views
ADD COMMENT
2
Entering edit mode

what is this "TCTCGCGC" in the read name?

ADD REPLY
1
Entering edit mode

The TCTCGCGC or other sequence following '1:N:0:' in the fastq header should be the Illumina barcode. The reads are usually demultiplexed by the sequencing center, and that process adds the barcode to the read header line.

ADD REPLY
1
Entering edit mode

after joining the reads into mate pairs.

Are you sure that is the correct way? I don't know what you exactly mean by this.

ADD REPLY
0
Entering edit mode

I have been researching different pipelines for handling fastq files and almost everything that I have read suggests preprocessing the fastq data prior to mapping the reads via bowtie2 or BWA etc. So by "joining the reads into mate pairs" I meant that my fastq files are currently separate files where file_1.fastq are the forward reads and file_2.fastq are the reverse reads. Most preprocessing pipelines suggest joining the two files before preforming any manipulations of the fastq data such as removing duplicates or trimming adapter sequencing.

ADD REPLY
1
Entering edit mode

What type of data do you have, what do you expect the insert length to be, and what type of analysis are you planning to do?

Most analyses on paired end data are done with the reads in separate files, one file for the forward reads and a second file for the reverse reads. You might want to merge the two reads of a pair if the insert length is shorter than twice the read length and you expect your forward and reverse reads to overlap.

ADD REPLY
1
Entering edit mode

Most preprocessing pipelines suggest joining the two files

That's odd. It's not impossible, but it's not the most common workflow to my knowledge. If there is one set of guidelines that you should care about, then it's the GATK best practices. I recommend following these.

ADD REPLY
0
Entering edit mode
7.7 years ago
dyollluap ▴ 310

It looks like you've got everything needed to run an alignment with read group (@RG) details for the header... This breaks down what each of the components for a read line actually encode: https://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

For example in bwa mem alignment: -R "@RG\tID:$your_unique_ID_of_choice\tSM:your_sample_name\tPL:Illumina"

You could use the read name details if appropriate, eg if both were run on the same flow cell: "@RG\tID:$SN996:194:H5V7HBCXY\tSM:myspecialsample\tPL:Illumina"

A barcode.txt file could be provided from a sequencing center, but if you're looking at the read names you can extract the same details anyway.

ADD COMMENT
0
Entering edit mode
7.7 years ago
kmurph55 ▴ 10

Thank you for the quick reply. The main problem that I am facing is that the Read Group identifier is supposed to be unique to each read for most NGS tools, but every read group ID is the exact same (with the exception of my x and y coordinates) in my fastq files.

ADD COMMENT
0
Entering edit mode

Please use ADD COMMENT to reply to earlier answers, as such this thread remains logically structured and easy to follow.

ADD REPLY
1
Entering edit mode

I don't think the Read Group ID should be unique to each read, I think it should be unique to each sample.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6