fastq reads with primers and truseq indexed adapter
0
0
Entering edit mode
7.7 years ago
jomo018 ▴ 730

I received 150bp SE Nextseq Fastq files from two similar experiments. Reads from one experiment are structured: 5' primer -- DNA -- 3' primer -- indexed adapter (partial - no index). Reads from the second experiment are structured: DNA-- 3' primer -- indexed adapter (almost complete - index included).

The reads of the first experiment are useless as they do not contain the index within the adapter. What controls the location of the "limited view window" within the sequence?

sequencing next-gen sequence • 3.1k views
ADD COMMENT
0
Entering edit mode

Index reads (1D/2D) in illumina technology are read independent of main reads and should never be part of actual sequence (unless you are using in-line barcodes). If you have indexed samples then the run has to be set up as such.

ADD REPLY
0
Entering edit mode

I am referring to Truseq indexed adapters, 63 base long and 6 bases in the middle acting as the index.

ADD REPLY
0
Entering edit mode

If you start seeing adapter on 3'-end of a read then that means your inserts are shorter than the length of sequencing being done (http://nextgen.mgh.harvard.edu/IlluminaChemistry.html ).

ADD REPLY
0
Entering edit mode

150 bases are indeed shorter than the complete sequence. Using the colors in the bottom picture of the link you sent: In the second experiment I see the black - purple - blue - yellow (partial) which is OK because I don't need the red and the blue (de-multiplexing index) is included. In the first experiment I see the red - black - purple (partial) which is no good because the blue is missing. So how I can I make sure the next experiment will indeed include the blue index.

ADD REPLY
0
Entering edit mode

To get the "index" the sequencing run has to be set up as "multiplexed". So you will specify 150 bp x N bp (N would be the length of the index you would want) as run-requirement (if you only want single-end reads). Was this not specified the first time around?

ADD REPLY
0
Entering edit mode

I don't know. I have access to the bcl run folder. Can this information be seen in one of the files e.g. RunParameters.xml or RTAConfiguration.xml ?

ADD REPLY
1
Entering edit mode

Look in the RunInfo.xml file (should be in the top level FC folder) to see if you can find a block like this

<Reads>
      <Read Number="1" NumCycles="50" IsIndexedRead="N" />
      <Read Number="2" NumCycles="7" IsIndexedRead="Y" />   
 </Reads>
ADD REPLY
0
Entering edit mode

Both experiments have the same block:

<Reads>
     <Read Number="1" NumCycles="150" IsIndexedRead="N" />
     <Read Number="2" NumCycles="8" IsIndexedRead="Y" />
     <Read Number="3" NumCycles="8" IsIndexedRead="Y" />    </Reads>
ADD REPLY
0
Entering edit mode

That shows that both these are dual-indexed (2D) runs and should have the same kind of reads. The reads go in the order Read 1 --> Index 1 --> Index 2 --> Read 2 (which you don't have).

So you have one sequence file per sample? Can you post the first few lines of the file (z)cat your_R1_fastq(.gz) | head -8.

ADD REPLY
0
Entering edit mode

Fastq files were extracted with bcl2fastq without SampleSheet. I guess I am not losing any information.

The run including index

 @NB551014:49:HHGGCAFXX:1:11101:9463:1046 1:N:0:0
 NTTTGGGGATTTGATTTAGTCGTAGTTTTTGTGAATTAATATTTGTGCGGTTTATATTTGGTGGAAGTTTTTTATTTAGTGTGCGGGGAACGAGGTTTTTTTTATATATTTAAGATTCGTCGGGAGGTAGAGGATTTGTAGGGTGAGTGA
 +
 #AAAAEEEEEEEEAEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEEAEAEEEEEEAE<EAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEAEEE
 @NB551014:49:HHGGCAFXX:1:11101:16334:1046 1:N:0:0
 NCTGTCTCCTGCATCCAATCCATTAAACTGACCTCCCCGTGCAGAGGCGGGGATACAACCATAAGACGAGAAGACCCTATGGAGCTTTAAACTAAAGGCAACTGCCAACTTCAACCTAACCCATAAGGAAATAACAATTAAACAAGCAGA
 +
 #AAAAEEEEAEEEEEAAEEEEEEEEEEEEEEAAEEEEEEEEEEEEAEEEEEEEEEE/EEEEAEEEEAEA/EEA/EEEEEEAAEEE<EEEEEEEEEEEEEEEEE/EEE<6AE6//EAEE/<E/E/<EE/EEEEEEEEE/AEEEEEE//E/E
  

The run with two primers and no index

 @NB501025:135:HJY7VAFXX:1:11101:16969:1049 1:N:0:0
  TTAGANAAGTAAAATGATGGATAATAACGTACGGTGAAACGTAGTGTTGGGAATCGTAGATGGAAGTCGAGTATTTTTTTTATTTGTGGGGATCGGAAGAGCACACGTCTGAACTCCAGTCCATTCCTATCTCGTATGCCGTCTTCTGCT
  +
  AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEEEEEE<AE<EAEE/EEE<<EEEEEAEAEE/EE/EEAAAAEEEA/<<A<A<AAAA//6AEE<A/A<<A<</EAA/AA
  @NB501025:135:HJY7VAFXX:1:11101:13840:1049 1:N:0:0
 ATCGTNACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCCTGATACCATTAAAATCCCTAAGCATT
 +
   AAAAA#/EEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAEEEEEEEEAEEEEAEEEEEEEEEEE/EEEAEE/E/E/EEAEE</A/A6AE<EE//<EAEAAEAA/<6AEE/EAEA<A/AEE6<</<E/E/AA/AAAAAEA//AA</6
  
ADD REPLY
0
Entering edit mode

Fastq files were extracted with bcl2fastq without SampleSheet. I guess I am not losing any information

If you were not interested in separating the samples then yes. But if you want to split the samples then you would need to re-run the demultiplexing using a proper samplesheet.

Otherwise this data is a mix of multiple samples which you can't tell apart by sequence you have in hand.

ADD REPLY
0
Entering edit mode

So you are saying that even though the Truseq indexed adapter (with its 6-base index) is missing from the reads as I see them after bcl2fastq-without-SampleSheet, a bcl2fastq with SampleSheet would still be able to do the demultiplexing? Where would it take the index from?

As a side note, I am doing the demultiplexing with a custom script targeting at the 3' primer which is always present and unique.

ADD REPLY
2
Entering edit mode

If you have a few minutes available then watch this short Illumina video (starting about about 2 min in). It will help clarify the order of sequencing I had posted above.

Only reason you may see the indexes in your reads (produced without a sample sheet) is because those inserts were shorter than the length of sequencing. Whoever ran this run should be able to help you get a Samplesheet in right format (in fact it should already be there in the raw folder you have, assuming it is a complete copy, should be called SampleSheet.csv).

There is no point in using custom scripts for demultiplexing data (unless one is using internal non-illumina barcodes). Doing demultiplexing with bcl2fastq is going to make sure that the demux is handled properly.

ADD REPLY
0
Entering edit mode

Thank you for pointing to the video. Could you possibly explain how to include a "63 base Truseq indexed adapter with a six base index embedded" within the SampleSheet. I have found no examples for this case and IEM doesn't help. The adapter is:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXXXATCTCGTATGCCGTCTTCTGCTTG

ADD REPLY
0
Entering edit mode

You don't need to do that. Just include the indexes for the samples. You can generate an example samplesheet from IEM to get general structure and then look at the examples here. There are slightly different samplesheets for CASAVA and bcl2fastq.

Your run data folder may already have a samplesheet (SampleSheet.csv) so look for that first.

ADD REPLY

Login before adding your answer.

Traffic: 2338 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6