PureCN prepare environment error
2
0
Entering edit mode
4.4 years ago

I'm trying to prepare the environment and assay specific files for the R package PureCN following https://www.bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/Quick.html

The R version that I am using is R 4.0.1

The command from the PureCN vignette that I am using is:

$ Rscript $PURECN/IntervalFile.R --infile baits_hg19.bed \ 
    --fasta hg19.fa --outfile $OUT_REF/baits_hg19_intervals.txt \
    --offtarget --genome hg19 \
    --export $OUT_REF/baits_optimized_hg19.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig

I am using hg38, so I have followed the instructions to change 19 to 38 for the previous steps, so my command from is:

$ Rscript $PURECN/IntervalFile.R --infile baits_hg38.bed \ 
    --fasta hg38.fa --outfile $OUT_REF/baits_hg38_intervals.txt \
    --offtarget --genome hg38 \
    --export $OUT_REF/baits_optimized_hg38.bed \
    --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \
    --reptiming wgEncodeUwRepliSeqK562WaveSignalRep1.bigWig

The error message that I am encountering and have not been able to figure out is:

Warning message:
package 'optparse' was built under R version 4.0.2
Warning message:
****no function found corresponding to methods exports from 'BSgenome' for: 'releaseName'**
Error in normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="hg38.fa": The system cannot find the file specified**
Execution halted

Any help or suggestions would be much appreciated.

R PureCN software error • 2.8k views
ADD COMMENT
0
Entering edit mode

Hello cg_ref_database,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
1
Entering edit mode

Thank you for doing that! I'll be sure to do so for future posts.

ADD REPLY
0
Entering edit mode
4.4 years ago

Hi, you sure hg38.fa is the correct path to the FASTA file?

  path[1]="hg38.fa": The system cannot find the file specified
  
ADD COMMENT
0
Entering edit mode

Thank you Markus, that was the problem for that part of the command. A new problem while still at the IntervalFile.R stage is that it gives the following error: [E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?)

Warning message:
package 'optparse' was built under R version 4.0.2
Warning message:
no function found corresponding to methods exports from 'BSgenome' for: 'releaseName'
Warning message:

package 'GenomeInfoDb' was built under R version 4.0.2
INFO [2020-07-02 19:55:07] Loading PureCN 1.19.2...
Warning messages:
1: package 'VariantAnnotation' was built under R version 4.0.2
2: package 'Rsamtools' was built under R version 4.0.2
3: package 'XVector' was built under R version 4.0.2
INFO [2020-07-02 19:55:07] Processing hglft_genome_probes_65b83_cdc6f0.bed.gz...
WARN [2020-07-02 19:55:08] Found 27928 overlapping intervals, starting at line 38.
WARN [2020-07-02 19:55:09] Target intervals were not sorted.
INFO [2020-07-02 19:55:11] Splitting 16807 large targets to an average width of 400.
INFO [2020-07-02 19:55:11] Tiling off-target regions to an average width of 200000.
INFO [2020-07-02 19:55:11] Removing following contigs from off-target regions: chr4_GL000008v2_random,chr14_GL000009v2_random,chr14_GL000225v1_random,chr15_KI270727v1_random,chr16_KI270728v1_random,chr17_KI270729v1_random,chrUn_KI270442v1,chrUn_KI270743v1,CMV
WARN [2020-07-02 19:55:12] No mappability scores provided.
WARN [2020-07-02 19:55:12] No reptiming scores provided.
INFO [2020-07-02 19:55:12] Calculating GC-content...
[E::faidx_fetch_seq2] Failed to retrieve block. (Seeking in a compressed, .gzi unindexed, file?)
Error in value[[3L]](cond) :
   record 172301 (chr13:40366455-40558971) failed
  file: GRCh38.d1.vd1.fa
Calls: preprocessIntervals ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr4_GL000008v2_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chr11_KI270721v1_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr14_KI270722v1_random, chr14_GL000194v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_GL000205v2_random, chr17_KI270729v1_random, chr17_KI270730v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_ [... truncated]
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr4_GL000008v2_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_KI270729v1_random, chrUn_KI270442v1, chrUn_KI270743v1, CMV
  - in 'y': chr1_KI270766v1_alt, chr7_KI270803v1_alt, chr17_KI270909v1_alt, chr19_KI270938v1_alt, chr22_KI270879v1_alt
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
Execution halted
ADD REPLY
0
Entering edit mode

The problem that I was encountering above was that the infile needed to be sorted, compressed and indexed:

bedtools sort -i [infile] > [infile.sorted]
bgzip -i [infile.sorted]
ADD REPLY
0
Entering edit mode

Glad you figured it out. Bioconductor should be able to work with uncompressed Fasta files if they are sorted (the official genome build Fasta files are). See the underlying ?scanFa function.

ADD REPLY
0
Entering edit mode

My apologies, I still need assistance with the prepare environment step. Any help would be very much appreciated.

The new error that I'm encountering now is:

INFO [2020-07-07 01:07:10] Loading GCA_000001405.15_GRCh38_no_alt_analysis_set_100.bw...

INFO [2020-07-07 01:08:04] Loading PureCN 1.18.0...
INFO [2020-07-07 01:08:04] Processing hglft_genome_probes_65b83_cdc6f0.sorted.bed.gz...
WARN [2020-07-07 01:08:05] Found 27928 overlapping intervals, starting at line 38.
WARN [2020-07-07 01:08:07] Target intervals were not sorted.
INFO [2020-07-07 01:08:17] Splitting 16807 large targets to an average width of 400.
INFO [2020-07-07 01:08:45] Tiling off-target regions to an average width of 200000.
INFO [2020-07-07 01:08:45] Removing following contigs from off-target regions: chr4_GL000008v2_random,chr14_GL000009v2_random,chr14_GL000225v1_random,chr15_KI270727v1_random,chr16_KI270728v1_random,chr17_KI270729v1_random,chrUn_KI270442v1,chrUn_KI270743v1
WARN [2020-07-07 01:09:46] 298 intervals without mappability score (298 on-target).
INFO [2020-07-07 01:09:46] Removing 5109 intervals with low mappability score (<0.50).
WARN [2020-07-07 01:09:46] No reptiming scores provided.
INFO [2020-07-07 01:09:46] Calculating GC-content...
Error in value[[3L]](cond) :
   record 172300 (chr14:16023139-16053427) contains invalid DNA letters
  file: GRCh38.d1.vd1.compress.bgzip.fa.gz
Calls: preprocessIntervals ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
In addition: Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrM, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr4_GL000008v2_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chr11_KI270721v1_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr14_KI270722v1_random, chr14_GL000194v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_GL000205v2_random, chr17_KI270729v1_random, chr17_KI270730v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_ [... truncated]
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': CMV
  - in 'y': chr11_KI270721v1_random, chr14_GL000194v1_random, chr14_KI270722v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr17_GL000205v2_random, chr17_KI270730v1_random, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_KI270736v1_random, chr22_KI270737v1_random, chr22_KI270738v1_random, chr22_KI270739v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1_random, chr5_GL000208v1_random, chr9_KI270717v1_random, chr9_KI270718v1_random, chr9_KI270719v1_random, chr9_KI270720v1_random, chrEBV, chrM, chrUn_GL000195v1, chrUn_GL00021 [... truncated]
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr4_GL000008v2_random, chr14_GL000009v2_random, chr14_GL000225v1_random, chr15_KI270727v1_random, chr16_KI270728v1_random, chr17_KI270729v1_random, chrUn_KI270442v1, chrUn_KI270743v1, CMV, chr11_KI270721v1_random, chr14_GL000194v1_random, chr14_KI270722v1_random, chr14_KI270723v1_random, chr14_KI270724v1_random, chr14_KI270725v1_random, chr14_KI270726v1_random, chr17_GL000205v2_random, chr17_KI270730v1_random, chr1_KI270706v1_random, chr1_KI270707v1_random, chr1_KI270708v1_random, chr1_KI270709v1_random, chr1_KI270710v1_random, chr1_KI270711v1_random, chr1_KI270712v1_random, chr1_KI270713v1_random, chr1_KI270714v1_random, chr22_KI270731v1_random, chr22_KI270732v1_random, chr22_KI270733v1_random, chr22_KI270734v1_random, chr22_KI270735v1_random, chr22_KI270736v1_random, chr22_KI270737v1_random, chr22_KI270738v1_random, chr22_KI270739v1_random, chr2_KI270715v1_random, chr2_KI270716v1_random, chr3_GL000221v1 [... truncated]
Execution halted
ADD REPLY
0
Entering edit mode

Okay, the problem was with the GRCh38.d1.vd1.fa that I obtained from the cancer.gov website (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/#step-1-converting-bams-to-fastqs-with-biobambam-biobambam2-2054) and that I used to process my tumor-only samples with. When I used GATK's fasta file, bundle/hg38/Homo_sapiens_assembly38.fasta, I was finally able to create the baits' txt and bed files.

My new fear is that I'll have to re-process all of the VCF files for all of the tumor-only samples using GATK's Homo_sapiens_assembly38.fasta file. I'll find out when I get to the step in the PureCN manual. Wish me luck, and thank you everyone (Markus) for any and all feedback that you provided as that helped me to finally complete this step.

To be continued...

ADD REPLY
0
Entering edit mode

Hm, this shouldn't be that difficult.

Did you run samtools faidx GRCh38.d1.vd1.fa? This should probably warn you when something is wrong with the Fasta.

ADD REPLY
0
Entering edit mode

Thank you for the samtools command to check if there was something wrong with the fasta. The fasta was fine. The problem appears to be with the liftover (37 -> 38) of the probe.bed file. A comparison between the liftover bed-file and the probe.bed in hg38 coordinates (I was unaware that this file was available, hence the liftover) revealed that there were five intervals that were not present in the liftover bed-file that were present in the bed-file that was already in hg38 coordinates.

ADD REPLY
0
Entering edit mode

Did you end up converting the input files for --mappability and --reptiming flags when you were trying to run with hg38? It seems like the input files you used were for hg19 configuration.

ADD REPLY
1
Entering edit mode

See section 2 of https://bioconductor.org/packages/release/bioc/vignettes/PureCN/inst/doc/Quick.html for links to precomputed map-ability files. Rep timing should be available from UCSC genome browser. It’s not critical and safe to run without.

ADD REPLY
0
Entering edit mode
4.4 years ago

Thanks. If you think this is something that should have been handled with a better error message, feel free to open a github issue with a minimal example to reproduce.

ADD COMMENT

Login before adding your answer.

Traffic: 1845 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6