Hi everyone,
for my project, I'm trying to merge two sets of NGS reads (for each, I want to merge R1 and R2 reads), sequenced on an Illumina NextSeq 500 instrument. One set of reads is coming from enriched genomic libraries and the other from "non-enriched"/regular libraries.
First, I've tried to do that using a simple command for 'cat' option: cat file1.fq.gz file2.fq.gz > mergedfile.fq.gz
After mapping and sorting etc., this seemed to cause a problem in the structure of .vcf file, so I wasn't able to use it for further analysis. I assume that the reason for the improper merging was the sequence identifiers and their structure. Now, I'm wondering if this is even possible to do and if yes, how?
A few examples of sequence identifiers of my fastq files:
Sample1 (Non-enriched), R1:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
Sample1 (Non-enriched), R2:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
Sample1 (Enriched), R1:
>A00801:156:HKYC2DSX2:2:1101:30056:27837 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:12445:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:23167:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:14136:27884 1:N:0:AGCGCTAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:17607:27884 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:32542:27884 1:N:0:AGCGATAGAT+AGGCTATAGT
Sample1 (Enriched), R2:
>A00801:156:HKYC2DSX2:2:1103:18801:28948 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:3423:28964 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:11297:28980 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:21450:28995 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:1714:29011 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:19985:29027 2:N:0:AGCGATAGAT+AGGCTATAGT
Just to clarify based on the example above: I want to merge Sample1 (Non-enriched) R1 with Sample1 (Enriched) R1 and Sample1 (Non-enriched) R2 with Sample1 (Enriched) R2. Hope that's clear.. Also: I'm performing my analysis on Linux Ubuntu.
Thank you in advance for your time and help :)
what kind of "problem" ?
Hi, thank you for your reply! First one was RStudio saying: "Your file appears to have 76 header elements and 77 columns in the body. This should never happen!". I believe there were no mistakes in my commandlines for mapping, sorting, creating .vcf files, because they work fine with other, non-merged reads (also produce "normal" .vcf file). After that, I deleted an extra column in the .vcf file and now, when importing it to RStudio, the software crashes (R session aborted, R encountered a fatal error ..) after I type in the command line for reading the file (vcfR package). The command starts running but finally crashes when creating a character matrix gt.
What you are describing is not strictly
merging
. It is simply concatenating two files end to end, such that contents of file 2 are tacked at the end of file 1.True read merging will overlap a read from R1 and its corresponding mate from R2 to create a single (if the reads can overlap in the middle) read.
If you wanted to do the former then carry on, but if it is latter then you need software like
bbmerge.sh
to do this. Your reads need to overlap as well.Hi, thank you for your reply and suggestions!
I don't want to merge overlapping paired-end reads, but join/merge the two R1 reads from different files (but same specimen) into one file, which I can further use for mapping.. and the same with R2 reads ofc.
Your simple
cat
approach sounds valid. Can you explain what you did with the data after merging and what was the problem with the VCF?Regardless, since you have two libraries generated with different protocols, I'm not sure merging them is a good idea...
Thank you for your reply!
Regarding the fact that the two sets of reads are based two different protocols, I understand the concern, but I also assume that the enrichment process is not "perfect" and that those reads will still contain some of non-enriched sequences, which I could maybe use for a better coverage. I think it's maybe worth a try..
Here is a rough desrciption of my command lines to create a vcf file:
indexing the reference sequence
mapping the reads to the ref sequence
filtering, sam to bam, sorting
creating one mpileup file from 68 samples
variant calling and vcf file creation (added text file with sample species names)
I'm pretty sure the problem is not with the fastq merging, as all your commands seem fine to me. Might be something with the pileup or variant calling steps. At what point did you get an error message, and what was it. Also, maybe give
bcftools call
a go - see if it does better than varscan.I will indeed try bfctools! Thanks :)
Also, I've checked the joined reads in Geneious. Here, it is visible that at the break of one set of reads into another, the identifier sequence changes and no longer has a consecutive descending number, indicating the end of one set and beginning of the other one (see attached photo). Could this be the reason the vcf file is not properly created for RStudio to read it?