indexing the reference sequence

Question

Merging two sets of NGS reads

0

Entering edit mode

3.5 years ago

Lora ▴ 10

Hi everyone,

for my project, I'm trying to merge two sets of NGS reads (for each, I want to merge R1 and R2 reads), sequenced on an Illumina NextSeq 500 instrument. One set of reads is coming from enriched genomic libraries and the other from "non-enriched"/regular libraries.

First, I've tried to do that using a simple command for 'cat' option: cat file1.fq.gz file2.fq.gz > mergedfile.fq.gz After mapping and sorting etc., this seemed to cause a problem in the structure of .vcf file, so I wasn't able to use it for further analysis. I assume that the reason for the improper merging was the sequence identifiers and their structure. Now, I'm wondering if this is even possible to do and if yes, how?

A few examples of sequence identifiers of my fastq files:

Sample1 (Non-enriched), R1:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 1:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Non-enriched), R2:
>A00801:156:HKYC2DSX2:1:1101:20808:13855 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:20410:13886 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:16441:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24252:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:24288:13933 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:1:1101:30092:13933 2:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Enriched), R1:
>A00801:156:HKYC2DSX2:2:1101:30056:27837 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:12445:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:23167:27868 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:14136:27884 1:N:0:AGCGCTAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:17607:27884 1:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1101:32542:27884 1:N:0:AGCGATAGAT+AGGCTATAGT

Sample1 (Enriched), R2:
>A00801:156:HKYC2DSX2:2:1103:18801:28948 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:3423:28964 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:11297:28980 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:21450:28995 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:1714:29011 2:N:0:AGCGATAGAT+AGGCTATAGT
>A00801:156:HKYC2DSX2:2:1103:19985:29027 2:N:0:AGCGATAGAT+AGGCTATAGT

Just to clarify based on the example above: I want to merge Sample1 (Non-enriched) R1 with Sample1 (Enriched) R1 and Sample1 (Non-enriched) R2 with Sample1 (Enriched) R2. Hope that's clear.. Also: I'm performing my analysis on Linux Ubuntu.

Thank you in advance for your time and help :)

NGS Linux fastq • 2.0k views

ADD COMMENT • link 3.5 years ago by Lora ▴ 10

0

Entering edit mode

After mapping and sorting etc., this seemed to cause a problem in the structure of .vcf file,

what kind of "problem" ?

ADD REPLY • link 3.5 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi, thank you for your reply! First one was RStudio saying: "Your file appears to have 76 header elements and 77 columns in the body. This should never happen!". I believe there were no mistakes in my commandlines for mapping, sorting, creating .vcf files, because they work fine with other, non-merged reads (also produce "normal" .vcf file). After that, I deleted an extra column in the .vcf file and now, when importing it to RStudio, the software crashes (R session aborted, R encountered a fatal error ..) after I type in the command line for reading the file (vcfR package). The command starts running but finally crashes when creating a character matrix gt.

ADD REPLY • link 3.5 years ago by Lora ▴ 10

0

Entering edit mode

merge two sets of NGS reads

What you are describing is not strictly merging. It is simply concatenating two files end to end, such that contents of file 2 are tacked at the end of file 1.

True read merging will overlap a read from R1 and its corresponding mate from R2 to create a single (if the reads can overlap in the middle) read.

If you wanted to do the former then carry on, but if it is latter then you need software like bbmerge.sh to do this. Your reads need to overlap as well.

ADD REPLY • link 3.5 years ago by GenoMax 152k

0

Entering edit mode

Hi, thank you for your reply and suggestions!

I don't want to merge overlapping paired-end reads, but join/merge the two R1 reads from different files (but same specimen) into one file, which I can further use for mapping.. and the same with R2 reads ofc.

ADD REPLY • link 3.5 years ago by Lora ▴ 10

0

Entering edit mode

Your simple cat approach sounds valid. Can you explain what you did with the data after merging and what was the problem with the VCF?
Regardless, since you have two libraries generated with different protocols, I'm not sure merging them is a good idea...

ADD REPLY • link 3.5 years ago by liorglic ★ 1.5k

0

Entering edit mode

Thank you for your reply!

Regarding the fact that the two sets of reads are based two different protocols, I understand the concern, but I also assume that the enrichment process is not "perfect" and that those reads will still contain some of non-enriched sequences, which I could maybe use for a better coverage. I think it's maybe worth a try..

Here is a rough desrciption of my command lines to create a vcf file:

indexing the reference sequence

bwa index reference.fasta

mapping the reads to the ref sequence

for i in *_R1_joined.fq.gz
do
  SAMPLE=$(echo ${i} | sed "s/_R1_\joined\.fq.gz//") 
  echo ${SAMPLE}_R1_joined.fq.gz ${SAMPLE}_R2_joined.fq.gz
bwa mem reference.fasta ${SAMPLE}_R1_joined.fq.gz ${SAMPLE}_R2_joined.fq.gz > ${SAMPLE}.sam
done

filtering, sam to bam, sorting

for i in *.sam ; do pattern=${i%%.sam} ; samtools view -F 0x4 -S -h -o ${pattern}filt.sam $i ; done ;
for i in *filt.sam ; do pattern=${i%%.sam} ; samtools view -F 0x4 -S -b -o ${pattern}.bam $i ; done ;  
for i in *.bam ; do pattern=${i%%.bam} ; samtools sort $i -o $i.sort ; done

creating one mpileup file from 68 samples

samtools mpileup -B -f reference.fasta $(printf '%s ' *.bam.sort) > myfile.mpileup

variant calling and vcf file creation (added text file with sample species names)

varscan mpileup2cns myfile.mpileup min-var-freq 0.01 --min-avg-qual 30 --min-freq-for-hom 0.9 --min-coverage 4 --output-vcf --vcf-sample-list NameList.txt --strand-filter 0  > myfileCNS.vcf

ADD REPLY • link 3.5 years ago by Lora ▴ 10

1

Entering edit mode

I'm pretty sure the problem is not with the fastq merging, as all your commands seem fine to me. Might be something with the pileup or variant calling steps. At what point did you get an error message, and what was it. Also, maybe give bcftools call a go - see if it does better than varscan.

ADD REPLY • link 3.5 years ago by liorglic ★ 1.5k

0

Entering edit mode

I will indeed try bfctools! Thanks :)

Also, I've checked the joined reads in Geneious. Here, it is visible that at the break of one set of reads into another, the identifier sequence changes and no longer has a consecutive descending number, indicating the end of one set and beginning of the other one (see attached photo). Could this be the reason the vcf file is not properly created for RStudio to read it? enter image description here

ADD REPLY • link 3.5 years ago by Lora ▴ 10