Hello everybody,
I am using CNVkit on my data using hg38 as reference. The command that I am using is the following:
cnvkit.py batch sample.bam -n control.bam -m wgs -f reference.fasta --target-avg-size 1000 --output-dir results/
So, on the CNVkit documentation page is reported that
The
batch --method wgs
option uses the given reference genome’s sequencing-accessible regions (“access” BED) as the “targets” – these will be calculated on the fly if not provided.
Problem is, when it calculates accesibile regions from the reference, there are a lot of them listed in the log file that are not present in the final .bed output, and I cannot figure out why. Those are the outputs I get on a short example:
Log file:
WGS protocol: recommend '--annotate' option (e.g. refFlat.txt) to help locate genes in output files.
NC_000001.10: Scanning for accessible regions
Accessible region NC_000001.10:10000-177417 (size 167417)
Accessible region NC_000001.10:227417-267719 (size 40302)
Accessible region NC_000001.10:317719-471368 (size 153649)
Accessible region NC_000001.10:521368-2634220 (size 2112852)
Accessible region NC_000001.10:2684220-3845268 (size 1161048)
Accessible region NC_000001.10:3995268-13052998 (size 9057730)
Accessible region NC_000001.10:13102998-13219912 (size 116914)
Accessible region NC_000001.10:13319912-13557162 (size 237250)
Accessible region NC_000001.10:13607162-17125658 (size 3518496)
Accessible region NC_000001.10:17175658-29878082 (size 12702424)
Accessible region NC_000001.10:30028082-103863906 (size 73835824)
Accessible region NC_000001.10:103913906-120697156 (size 16783250)
Accessible region NC_000001.10:120747156-120936695 (size 189539)
Accessible region NC_000001.10:121086695-121485434 (size 398739)
Accessible region NC_000001.10:142535434-142731022 (size 195588)
Accessible region NC_000001.10:142781022-142967761 (size 186739)
Accessible region NC_000001.10:143117761-143292816 (size 175055)
Accessible region NC_000001.10:143342816-143544525 (size 201709)
Accessible region NC_000001.10:143644525-143771002 (size 126477)
Accessible region NC_000001.10:143871002-144095783 (size 224781)
Accessible region NC_000001.10:144145783-144224481 (size 78698)
Accessible region NC_000001.10:144274481-144401744 (size 127263)
Accessible region NC_000001.10:144451744-144622413 (size 170669)
Accessible region NC_000001.10:144672413-144710724 (size 38311)
Accessible region NC_000001.10:144810724-145833118 (size 1022394)
Accessible region NC_000001.10:145883118-146164650 (size 281532)
Accessible region NC_000001.10:146214650-146253299 (size 38649)
Accessible region NC_000001.10:146303299-148026038 (size 1722739)
Accessible region NC_000001.10:148176038-148361358 (size 185320)
Accessible region NC_000001.10:148511358-148684147 (size 172789)
Accessible region NC_000001.10:148734147-148954460 (size 220313)
Accessible region NC_000001.10:149004460-149459645 (size 455185)
Accessible region NC_000001.10:149509645-205922707 (size 56413062)
Accessible region NC_000001.10:206072707-206332221 (size 259514)
Accessible region NC_000001.10:206482221-223747846 (size 17265625)
Accessible region NC_000001.10:223797846-235192211 (size 11394365)
Accessible region NC_000001.10:235242211-248908210 (size 13665999)
Accessible region NC_000001.10:249058210-249240621 (size 182411)
NT_113878.1: Scanning for accessible regions
Accessible region NT_113878.1:0-106433 (size 106433)
NT_167207.1: Scanning for accessible regions
Accessible region NT_167207.1:0-547496 (size 547496)
NC_000002.11: Scanning for accessible regions
Accessible region NC_000002.11:10000-3529312 (size 3519312)
Accessible region NC_000002.11:3579312-5018788 (size 1439476)
Accessible region NC_000002.11:5118788-16279724 (size 11160936)
Accessible region NC_000002.11:16329724-21153113 (size 4823389)
Accessible region NC_000002.11:21178113-31705550 (size 10527437)
Accessible region NC_000002.11:31705551-31725939 (size 20388)
Accessible region NC_000002.11:31726790-31816827 (size 90037)
Accessible region NC_000002.11:31816828-31816854 (size 26)
Accessible region NC_000002.11:31816855-31816858 (size 3)
Accessible region NC_000002.11:31816859-33092197 (size 1275338)
Accessible region NC_000002.11:33093197-33141692 (size 48495)
Accessible region NC_000002.11:33142692-87668206 (size 54525514)
Accessible region NC_000002.11:87718206-89630436 (size 1912230)
Accessible region NC_000002.11:89830436-90321525 (size 491089)
Accessible region NC_000002.11:90371525-90545103 (size 173578)
Accessible region NC_000002.11:91595103-92326171 (size 731068)
Accessible region NC_000002.11:95326171-110109337 (size 14783166)
Accessible region NC_000002.11:110251337-149690582 (size 39439245)
Accessible region NC_000002.11:149790582-234003741 (size 84213159)
Accessible region NC_000002.11:234053741-239801978 (size 5748237)
Accessible region NC_000002.11:239831978-240784132 (size 952154)
Accessible region NC_000002.11:240809132-243102476 (size 2293344)
Accessible region NC_000002.11:243152476-243189373 (size 36897)
NC_000003.11: Scanning for accessible regions
Accessible region NC_000003.11:60000-17137943 (size 17077943)
NT_113878.1: Joining over small gaps
NT_167207.1: Joining over small gaps
Wrote GCF_000001405.25_GRCh37.p13_genomic.bed with 2 regions
So, multuple accessible regions are present for sequences NC_000001.10 , NC_000002.11 and NC_000003.11, but only regions for NT_113878.1 and NT_167207.1 are reported:
NT_113878.1 0 106433
NT_167207.1 0 547496
Why is that so? I cannot really figure out.
EDIT:
Following I report the .bam file header and the reference header, respectively:
.bam
@HD VN:1.6 SO:coordinate
@SQ SN:NC_000001.10 LN:249250621
@SQ SN:NT_113878.1 LN:106433
@SQ SN:NT_167207.1 LN:547496
@SQ SN:NC_000002.11 LN:243199373
@SQ SN:NC_000003.11 LN:17137943
ref.fasta
>NC_000001.10
>NT_113878.1
>NT_167207.1
>NC_000002.11
>NC_000003.11
cross-posted: https://bioinformatics.stackexchange.com/questions/20144/