I everyone,
I am trying to realign a bam file from hg37 to hg38. The problem is that I am losing half of the reads while the process. The fastq is paired-end.
Here the samtools flagstat for both:
Raw bam (hg37)
2457420710 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
49697790 + 0 supplementary
93027061 + 0 duplicates
2453403628 + 0 mapped (99.84% : N/A)
2407722920 + 0 paired in sequencing
1203861460 + 0 read1
1203861460 + 0 read2
2354059656 + 0 properly paired (97.77% : N/A)
2400927418 + 0 with itself and mate mapped
2778420 + 0 singletons (0.12% : N/A)
33747364 + 0 with mate mapped to a different chr
17954283 + 0 with mate mapped to a different chr (mapQ>=5)
New aligned bam (hg38):
1220323060 + 0 in total (QC-passed reads + QC-failed reads)
1536660 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1198748250 + 0 mapped (98.23% : N/A)
26942266 + 0 paired in sequencing
13471133 + 0 read1
13471133 + 0 read2
4244064 + 0 properly paired (15.75% : N/A)
6058498 + 0 with itself and mate mapped
7556561 + 0 singletons (28.05% : N/A)
1669472 + 0 with mate mapped to a different chr
833174 + 0 with mate mapped to a different chr (mapQ>=5)
For the generation of fastq files from bam I am using samToBam from GATK
java -XX:ParallelGCThreads=${cores} -jar $picard SamToFastq I=$file FASTQ=$file_R1 SECOND_END_FASTQ=$file_R2
And then I made the align using bwa and sam tools for the postprocessing of the bam:
bwa mem -p -M -t $cores -R ${rg_info} -v 1 ${ref} $file_R1 $file_R2 | \
samtools view -@ $cores -Sb - | \
samtools sort -@ $cores - -o $outBam
And for marking duplicates:
java -jar -XX:ParallelGCThreads=${cores} \
-Djava.io.tmpdir=$(dirname $outBam)/tmp \
$picard MarkDuplicates \
I=${outBam} O=${outBamNonExt}_mkdup.bam \
M=${fileNonExt}_mkdup.log
samtools index -@ $cores ${outBamNonExt}_mkdup.bam
Once created the fastq I analysed using fastQValidator and I got this output:
> ERROR on Line 45: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1416:6937/1 at Lines 41 and 45 ERROR on
> Line 245: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1417:29834/1 at Lines 241 and 245 ERROR
> on Line 1013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1420:64620/1 at Lines 1009 and 1013
> ERROR on Line 1737: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:44831/1 at Lines 1733 and 1737
> ERROR on Line 1893: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:81419/1 at Lines 1889 and 1893
> ERROR on Line 2253: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:73804/1 at Lines 2249 and 2253
> ERROR on Line 2321: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:98182/1 at Lines 2317 and 2321
> ERROR on Line 2969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2969
> ERROR on Line 2973: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2973
> ERROR on Line 3193: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1426:11367/1 at Lines 3189 and 3193
> ERROR on Line 3969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:27238/1 at Lines 3965 and 3969
> ERROR on Line 4153: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:70127/1 at Lines 4149 and 4153
> ERROR on Line 4901: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:61256/1 at Lines 4897 and 4901
> ERROR on Line 4969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:73752/1 at Lines 4965 and 4969
> ERROR on Line 5013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5013
> ERROR on Line 5017: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5017
> ERROR on Line 5105: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:6958/1 at Lines 5101 and 5105
> ERROR on Line 5173: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:22113/1 at Lines 5169 and 5173
> ERROR on Line 5225: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:29783/1 at Lines 5221 and 5225
> ERROR on Line 5517: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:94291/1 at Lines 5513 and 5517
> Finished processing
> /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq
> with 580178304 lines containing 1218786400 sequences. There were a
> total of 14924940 errors. Returning: 1 : FASTQ_INVALID
If I look in the @RG of the original bam using
samtools view -H FI43683.bam | grep '@RG'
The output is as follows:
@RG ID:QCMG:20130118043852808 PL:ILLUMINA CN:QCMG PI:330 DT:2013-01-03T00:00:00+00:00 PM:Illumina HiSeq 2000 LB:WGS:QCMG:Library_20121203_T SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_3
@RG ID:QCMG:20130118055238928 PL:ILLUMINA CN:QCMG PI:330 DT:2013-01-03T00:00:00+00:00 PM:Illumina HiSeq 2000 LB:WGS:QCMG:Library_20121203_T SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_2
@RG ID:QCMG:20130118062743433 PL:ILLUMINA CN:QCMG PI:330 DT:2013-01-03T00:00:00+00:00 PM:Illumina HiSeq 2000 LB:WGS:QCMG:Library_20121203_T SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_5
@RG ID:QCMG:20130118064849885 PL:ILLUMINA CN:QCMG PI:330 DT:2013-01-03T00:00:00+00:00 PM:Illumina HiSeq 2000 LB:WGS:QCMG:Library_20121203_T SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_4
@RG ID:QCMG:20130118121216373 PL:ILLUMINA CN:QCMG PI:330 DT:2013-01-03T00:00:00+00:00 PM:Illumina HiSeq 2000 LB:WGS:QCMG:Library_20121203_T SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_1
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118043852808\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_3\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz - ID:bwa_0 PN:bwa VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118055238928\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_2\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz - ID:bwa_1 PN:bwa PP:bamsort_0 VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118062743433\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_5\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz - ID:bwa_2 PN:bwa PP:bamsort_1 VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118064849885\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_4\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz - ID:bwa_3 PN:bwa PP:bamsort_2 VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118121216373\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_1\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz - ID:bwa_4 PN:bwa PP:bamsort_3 VN:0.7.8-r455
Whereas in the new bam (as expected) is like this:
@RG ID:FI43683 SM:FI43683 PL:ILLUMINA
@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa mem -p -M -t 16 -R @RG\tID:FI43683\tSM:FI43683\tPL:ILLUMINA /imppc/labs/lplab/msubirana/../share/marc/refgen/hg38/hg38.fa /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R2.fq
Any suggestions?
It worked perfectly thanks!
This might be a good place for me to pose my extreme-newbie question. My hg19 full genome was produced for me by Dante Labs, and my next step is to convert (re-align?) one of the resultant files into 'hg38' for optimal results when I submit it to 'YFull'.
I set my Linux machine to use the Python 'crossmap' utility for this purpose, but I have now been informed that using this tool to re-align my 'BAM' file this way is not the correct path to use -- that I should use my 'FASTQ' files instead. So now I have been seeking out some utilities to re-align my FASTQ files from 'hg19' to 'hg38' instead -- and also found references to 'BAZAM', though none of the examples I have found (so far) specifically mention this type of conversion. Your positive response makes it seem that you were successful. Can you provide some tips? Thanks!
To convert a hg19 bam to hg38 you would indeed use bazam. To align reads in fastq format to GRCh38 you would use bwa mem.
I believe bazam uses bwa mem under the hood, and handles the correct conversion to fastq so you don't have to worry about those steps.
Thanks! Since Dante Labs provided the FASTQ files, I am going to attempt to use 'bwa mem' directly to format to GRCh38. All these large files are giving my HD and CPU a good workout! :)