Entering edit mode
2.8 years ago
Kyle
▴
10
So I have the following vcf file - with the following metadata:
test_vcf <- readVcf(open(VcfFile(file = "C3H_HeH.mgp.v5.snps.dbSNP142.vcf.gz",
index = "mouse-snps-all.annots.vcf.gz.tbi")))
str(test_vcf@info)
formal class 'DFrame' [package "S4Vectors"] with 6 slots
..@ rownames : NULL
..@ nrows : int 1678126
..@ listData :List of 4
.. ..$ INDEL: logi [1:1678126] FALSE FALSE FALSE FALSE FALSE FALSE ...
.. ..$ DP : int [1:1678126] 15 10 15 8 11 6 18 11 15 23 ...
.. ..$ DP4 :Formal class 'CompressedIntegerList' [package "IRanges"] with 5 slots
.. .. .. ..@ elementType : chr "integer"
.. .. .. ..@ elementMetadata: NULL
.. .. .. ..@ metadata : list()
.. .. .. ..@ unlistData : int [1:6712504] 0 0 10 5 0 0 8 2 0 0 ...
.. .. .. ..@ partitioning :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
.. .. .. .. .. ..@ end : int [1:1678126] 4 8 12 16 20 24 28 32 36 40 ...
.. .. .. .. .. ..@ NAMES : chr [1:1678126] "8" "9" "13" "16" ...
.. .. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. .. ..@ metadata : list()
.. ..$ CSQ :Formal class 'CompressedCharacterList' [package "IRanges"] with 5 slots
.. .. .. ..@ elementType : chr "character"
.. .. .. ..@ elementMetadata: NULL
.. .. .. ..@ metadata : list()
.. .. .. ..@ unlistData : chr [1:3365412] "A||||intergenic_variant||||||||" "G||||intergenic_variant||||||||" "A||||intergenic_variant||||||||"
"C||||intergenic_variant||||||||" ...
.. .. .. ..@ partitioning :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
.. .. .. .. .. ..@ end : int [1:1678126] 1 2 3 4 5 6 7 8 9 10 ...
.. .. .. .. .. ..@ NAMES : NULL
.. .. .. .. .. ..@ elementType : chr "ANY"
.. .. .. .. .. ..@ elementMetadata: NULL
.. .. .. .. .. ..@ metadata : list()
..@ elementType : chr "ANY"
..@ elementMetadata: NULL
..@ metadata : list()
I want to filter the vcf using values inside the $CSQ column, however, this structure is an uneven CompressedCharacterList where one or multiple values can exist inside one element (as a str / vector, respectively). This is biologically sensible as the same site has multiple predictions but it breaks every function I've tried to use so far (such as str_detect).
The only thing that I can get to work is iterating through the vcf and unlisting the indexed data:
bad_list = vector(length = length(SNP_data))
# TODO: fix this dumb loop
for (i in seq_along(SNP_data@info$CSQ)){
bad_list[i]=stringr::str_detect(SNP_data@info$CSQ[i]@unlistData, "intergenic_variant")
}
funct_SNPs <- test_vcf[bad_list != TRUE]
But this is dumb - what's a function that can handle this without making my eyes bleed?