Illumina Raw Data Multiple Read FIles Discussion
1
1
Entering edit mode
5 months ago
Umer ▴ 130

Hi,

Background: We got Fusarium oxysporum genome sequenced.

  1. Genome Size: ~60mb
  2. Coverage: 100x
  3. Platform: Illumina NOvaseq x (Paired 150bp)

GOAL: Denovo Genome Assemblies

Problem 1:

The company gave multiple files per sample (ideally it shoiuld be just Forward and Reverse reads). BUt on discussion they said that their main concern was atleast 6Gb of data per sample. In 1st run they didnot achieve this. So the rerun the samples and got >6Gb data per sample in 2nd run. Now they gave us files from both 1st and 2nd run. So now for Sample ILL_02 I have 4 files as follow.

ILL_02_MKDN240005763-1A_227NJ5LT4_L4_1.fq.gz
ILL_02_MKDN240005763-1A_227NJ5LT4_L4_2.fq.gz
ILL_02_MKDN240005763-1A_227NJMLT4_L8_1.fq.gz
ILL_02_MKDN240005763-1A_227NJMLT4_L8_2.fq.gz

Solution I got: The solution I was presented to this problem was to just merge the files using zcat as

zcat ILL_02_MKDN240005763-1A_227NJ5LT4_L4_1.fq.gz ILL_02_MKDN240005763-1A_227NJMLT4_L8_1.fq.gz > ILL_02_merged_1.fq.gz
zcat ILL_02_MKDN240005763-1A_227NJ5LT4_L4_2.fq.gz ILL_02_MKDN240005763-1A_227NJMLT4_L8_2.fq.gz > ILL_02_merged_2.fq.gz

Question: Is this really a good approach ? Is their anyother method to merge these datasets of two runs ?

novaseqX pairedend illumina fungus • 660 views
ADD COMMENT
1
Entering edit mode

Was the same library re-run or was a new set of libraries made? Did they run a single pool of libraries on all lanes? If it is the same library re-run on two flowcells then these are technical sequencing replicates. There should be little, if any, batch effect unless a different chemistry was used for two runs.

You can use plain cat. No need to use zcat here.

ADD REPLY
0
Entering edit mode

the representative of sequencing company gave me this responce when i asked about multiple files per sample and i quote here

for the samples that pass our QC, we ensure the final data output in most cases. Thus, if your samples do not generate the desired 6G in the first round of sequencing, we will perform a second and even a third round of sequencing for free if needed until we reach 6G. That is the reason why you can see more than one file for some of the samples; still, they were all treated in the same way and through the same pipeline so merging the results is totally fine.

I think they jusr re-run the same library.

ADD REPLY
1
Entering edit mode

As long as the lower yield first time around was not because of a problem of some sort with the software/hardware it should be fine to merge the data.

ADD REPLY
2
Entering edit mode
5 months ago
dthorbur ★ 2.5k

I would keep the the files separate. This way if one of the lanes failed, you can identify where the problematic reads are coming from. But you can use a tool like multiqc to merge all fastqc outputs into a single one. Works with many other bioinformatic tool outputs too for other quality assessment metrics.

ADD COMMENT
0
Entering edit mode

the think with keeping the files separate is that, I have the following samples and the QC analysis with fastQC gives the following stats.

 Sample                                           Total_Sequences   Total_Bases    Seq_Length       GC%
 ILL_02_MKDN240005763-1A_227NJ5LT4_L4_1.fq.gz          7202262         1 Gbp             150         47
 ILL_02_MKDN240005763-1A_227NJ5LT4_L4_2.fq.gz          7202262         1 Gbp             150         47         
 ILL_02_MKDN240005763-1A_227NJMLT4_L8_1.fq.gz         83062254        12.4 Gbp           150         48
 ILL_02_MKDN240005763-1A_227NJMLT4_L8_2.fq.gz         83062254        12.4 Gbp           150         48

Should I completly Ignore the smaller first two files? or merge ? You can see the reply form the seq-company above. they said it will be ok to merge the files.

ADD REPLY
1
Entering edit mode

Should be fine to merge the files. If the assembler you are planning to use can take more than one pair then supply the files as is (in the right order).

ADD REPLY
0
Entering edit mode

For short read assembly, I am planning to go for SPADES v4.

I am not sure if spades takes multiple paired-end files.

Should I also use the --isolate option and --careful option in spades ?

ADD REPLY

Login before adding your answer.

Traffic: 1962 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6