Question

Loss of reads during realignment bam

0

Entering edit mode

5.6 years ago

mb2subi ▴ 10

I everyone,

I am trying to realign a bam file from hg37 to hg38. The problem is that I am losing half of the reads while the process. The fastq is paired-end.

Here the samtools flagstat for both:

Raw bam (hg37)

2457420710 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
49697790 + 0 supplementary
93027061 + 0 duplicates
2453403628 + 0 mapped (99.84% : N/A)
2407722920 + 0 paired in sequencing
1203861460 + 0 read1
1203861460 + 0 read2
2354059656 + 0 properly paired (97.77% : N/A)
2400927418 + 0 with itself and mate mapped
2778420 + 0 singletons (0.12% : N/A)
33747364 + 0 with mate mapped to a different chr
17954283 + 0 with mate mapped to a different chr (mapQ>=5)

New aligned bam (hg38):

1220323060 + 0 in total (QC-passed reads + QC-failed reads)
1536660 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1198748250 + 0 mapped (98.23% : N/A)
26942266 + 0 paired in sequencing
13471133 + 0 read1
13471133 + 0 read2
4244064 + 0 properly paired (15.75% : N/A)
6058498 + 0 with itself and mate mapped
7556561 + 0 singletons (28.05% : N/A)
1669472 + 0 with mate mapped to a different chr
833174 + 0 with mate mapped to a different chr (mapQ>=5)

For the generation of fastq files from bam I am using samToBam from GATK

java -XX:ParallelGCThreads=${cores} -jar $picard SamToFastq I=$file FASTQ=$file_R1 SECOND_END_FASTQ=$file_R2

And then I made the align using bwa and sam tools for the postprocessing of the bam:

bwa mem -p -M -t $cores -R ${rg_info} -v 1 ${ref} $file_R1 $file_R2 | \
samtools view -@ $cores -Sb - | \
samtools sort -@ $cores - -o $outBam

And for marking duplicates:

java -jar -XX:ParallelGCThreads=${cores} \
    -Djava.io.tmpdir=$(dirname $outBam)/tmp \
    $picard MarkDuplicates \
    I=${outBam} O=${outBamNonExt}_mkdup.bam \
    M=${fileNonExt}_mkdup.log

samtools index -@ $cores ${outBamNonExt}_mkdup.bam

Once created the fastq I analysed using fastQValidator and I got this output:

> ERROR on Line 45: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1416:6937/1 at Lines 41 and 45 ERROR on
> Line 245: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1417:29834/1 at Lines 241 and 245 ERROR
> on Line 1013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1420:64620/1 at Lines 1009 and 1013
> ERROR on Line 1737: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:44831/1 at Lines 1733 and 1737
> ERROR on Line 1893: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1422:81419/1 at Lines 1889 and 1893
> ERROR on Line 2253: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:73804/1 at Lines 2249 and 2253
> ERROR on Line 2321: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1423:98182/1 at Lines 2317 and 2321
> ERROR on Line 2969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2969
> ERROR on Line 2973: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1425:54077/1 at Lines 2965 and 2973
> ERROR on Line 3193: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1426:11367/1 at Lines 3189 and 3193
> ERROR on Line 3969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:27238/1 at Lines 3965 and 3969
> ERROR on Line 4153: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1428:70127/1 at Lines 4149 and 4153
> ERROR on Line 4901: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:61256/1 at Lines 4897 and 4901
> ERROR on Line 4969: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:73752/1 at Lines 4965 and 4969
> ERROR on Line 5013: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5013
> ERROR on Line 5017: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1430:84096/1 at Lines 5009 and 5017
> ERROR on Line 5105: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:6958/1 at Lines 5101 and 5105
> ERROR on Line 5173: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:22113/1 at Lines 5169 and 5173
> ERROR on Line 5225: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:29783/1 at Lines 5221 and 5225
> ERROR on Line 5517: Repeated Sequence Identifier:
> HWI-ST1407:67:C1EP1ACXX:1:1101:1431:94291/1 at Lines 5513 and 5517
> Finished processing
> /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq
> with 580178304 lines containing 1218786400 sequences. There were a
> total of 14924940 errors. Returning: 1 : FASTQ_INVALID

If I look in the @RG of the original bam using

samtools view -H FI43683.bam | grep '@RG'

The output is as follows:

@RG ID:QCMG:20130118043852808   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_3
@RG ID:QCMG:20130118055238928   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_2
@RG ID:QCMG:20130118062743433   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_5
@RG ID:QCMG:20130118064849885   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_4
@RG ID:QCMG:20130118121216373   PL:ILLUMINA CN:QCMG PI:330  DT:2013-01-03T00:00:00+00:00    PM:Illumina HiSeq 2000  LB:WGS:QCMG:Library_20121203_T  SM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 PU:QCMG:BC1EP1ACXX_1
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118043852808\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_3\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_0    PN:bwa  VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118055238928\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_2\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_1    PN:bwa  PP:bamsort_0    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118062743433\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_5\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_2    PN:bwa  PP:bamsort_1    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118064849885\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_4\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_3    PN:bwa  PP:bamsort_2    VN:0.7.8-r455
@PG CL:/glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/bin/PCAP-core-1.1.1/bin/bwa mem -t 8 -p -T 0 -R @RG\tID:QCMG:20130118121216373\tCN:QCMG\tDT:2013-01-03T00:00:00+00:00\tLB:WGS:QCMG:Library_20121203_T\tPI:330\tPL:ILLUMINA\tPM:Illumina HiSeq 2000\tPU:QCMG:BC1EP1ACXX_1\tSM:a4beedc3-0e96-4e1c-90b4-3674dfc01786 /glusterfs/netapp/homes1/BOCONNOR/provisioned-bundles/Workflow_Bundle_BWA_2.6.0_SeqWare_1.0.15/Workflow_Bundle_BWA/2.6.0/data/reference/bwa-0.6.2/genome.fa.gz -  ID:bwa_4    PN:bwa  PP:bamsort_3    VN:0.7.8-r455

Whereas in the new bam (as expected) is like this:

@RG ID:FI43683  SM:FI43683  PL:ILLUMINA
@PG ID:bwa  PN:bwa  VN:0.7.15-r1140 CL:bwa mem -p -M -t 16 -R @RG\tID:FI43683\tSM:FI43683\tPL:ILLUMINA /imppc/labs/lplab/msubirana/../share/marc/refgen/hg38/hg38.fa /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R1.fq /imppc/labs/lplab/share/marc/scarpa/processed/bam/hg37/FI43683_bedTools_R2.fq

Any suggestions?

gatk bwa wgs fastQValidator samtools • 2.0k views

ADD COMMENT • link 5.6 years ago by mb2subi ▴ 10

0

Entering edit mode

5.6 years ago

ATpoint 85k

Make sure your BAM is name-sorted or at least shuffled/grouped (samtools sort or collate) prior to converting to fastq. BWA expects random fastq order. For conversion I have good experience with samtools fastq -n. I think the issue was that your BAM file was indeed not name-sorted. Therefore reads not actually belonging together in terms of being mates were grouped, which is reflected by the poor percentage of properly-paired reads. Repeat everything with the above suggestions and see what comes out. Maybe test with a subset of the original BAM to have things a bit faster.

ADD COMMENT • link 5.6 years ago by ATpoint 85k

score 3 · Accepted Answer · 2019-04-24

3

Entering edit mode

5.6 years ago

WouterDeCoster 47k

I'd suggest taking a look at BAZAM: https://github.com/ssadedin/bazam

ADD COMMENT • link 5.6 years ago by WouterDeCoster 47k

0

Entering edit mode

It worked perfectly thanks!

ADD REPLY • link 5.6 years ago by mb2subi ▴ 10

0

Entering edit mode

This might be a good place for me to pose my extreme-newbie question. My hg19 full genome was produced for me by Dante Labs, and my next step is to convert (re-align?) one of the resultant files into 'hg38' for optimal results when I submit it to 'YFull'.

I set my Linux machine to use the Python 'crossmap' utility for this purpose, but I have now been informed that using this tool to re-align my 'BAM' file this way is not the correct path to use -- that I should use my 'FASTQ' files instead. So now I have been seeking out some utilities to re-align my FASTQ files from 'hg19' to 'hg38' instead -- and also found references to 'BAZAM', though none of the examples I have found (so far) specifically mention this type of conversion. Your positive response makes it seem that you were successful. Can you provide some tips? Thanks!

ADD REPLY • link 4.8 years ago by darethehair • 0

0

Entering edit mode

To convert a hg19 bam to hg38 you would indeed use bazam. To align reads in fastq format to GRCh38 you would use bwa mem.

I believe bazam uses bwa mem under the hood, and handles the correct conversion to fastq so you don't have to worry about those steps.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks! Since Dante Labs provided the FASTQ files, I am going to attempt to use 'bwa mem' directly to format to GRCh38. All these large files are giving my HD and CPU a good workout! :)

ADD REPLY • link 4.8 years ago by darethehair • 0