Question

randomreads.sh only produces reads for chr1 to chr7

0

Entering edit mode

19 months ago

berndmann ▴ 10

I used randomreads.sh from bbmap to generate reads from a fasta file generated with FastaAlternateReferenceMaker from GATK. It seems no matter which options I choose the script stops generating reads and chromosome 7 despite the fasta file contains all contigs from human_g1k_v37.fasta. See:

 sed -n 's/^>//p' alternate.fasta
1 1:1-249250621
2 2:1-243199373
3 3:1-198022430
4 4:1-191154276
5 5:1-180915260
6 6:1-171115067
7 7:1-159138663
8 8:1-146364022
9 9:1-141213431
10 10:1-135534747
11 11:1-135006516
12 12:1-133851895
13 13:1-115169878
14 14:1-107349540
15 15:1-102531392
16 16:1-90354753
17 17:1-81195210
18 18:1-78077248
19 19:1-59128983
20 20:1-63025520
21 21:1-48129895
22 22:1-51304566
23 X:1-155270560
24 Y:1-59373566
25 MT:1-16569
26 GL000207.1:1-4262
27 GL000226.1:1-15008
28 GL000229.1:1-19913
29 GL000231.1:1-27386
30 GL000210.1:1-27682
31 GL000239.1:1-33824
32 GL000235.1:1-34474
33 GL000201.1:1-36148
34 GL000247.1:1-36422
35 GL000245.1:1-36651
36 GL000197.1:1-37175
37 GL000203.1:1-37498
38 GL000246.1:1-38154
39 GL000249.1:1-38502
40 GL000196.1:1-38914
41 GL000248.1:1-39786
42 GL000244.1:1-39929
43 GL000238.1:1-39939
44 GL000202.1:1-40103
45 GL000234.1:1-40531
46 GL000232.1:1-40652
47 GL000206.1:1-41001
48 GL000240.1:1-41933
49 GL000236.1:1-41934
50 GL000241.1:1-42152
51 GL000243.1:1-43341
52 GL000242.1:1-43523
53 GL000230.1:1-43691
54 GL000237.1:1-45867
55 GL000233.1:1-45941
56 GL000204.1:1-81310
57 GL000198.1:1-90085
58 GL000208.1:1-92689
59 GL000191.1:1-106433
60 GL000227.1:1-128374
61 GL000228.1:1-129120
62 GL000214.1:1-137718
63 GL000221.1:1-155397
64 GL000209.1:1-159169
65 GL000218.1:1-161147
66 GL000220.1:1-161802
67 GL000213.1:1-164239
68 GL000211.1:1-166566
69 GL000199.1:1-169874
70 GL000217.1:1-172149
71 GL000216.1:1-172294
72 GL000215.1:1-172545
73 GL000205.1:1-174588
74 GL000219.1:1-179198
75 GL000224.1:1-179693
76 GL000223.1:1-180455
77 GL000195.1:1-182896
78 GL000212.1:1-186858
79 GL000222.1:1-186861
80 GL000200.1:1-187035
81 GL000193.1:1-189789
82 GL000194.1:1-191469
83 GL000225.1:1-211173
84 GL000192.1:1-547496

Is this an expected behavior or can one change this to produce all/different chromosomes?

Best,

Berndmann

randomreads.sh bbmap • 2.2k views

ADD COMMENT • link updated 18 months ago by GenoMax 148k • written 19 months ago by berndmann ▴ 10

0

Entering edit mode

Provide the command line used. My guess is your job ran out of memory.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

randomreads.sh ref=alternative.fasta out=somebam.bam length=120 coverage=0.0001665369 simplenames=T illuminanames=T -Xmx100G

ADD REPLY • link 19 months ago by berndmann ▴ 10

1

Entering edit mode

Not quite sure what you are going for here. You have set a pretty low coverage value so my guess is that may be one reason you do not have all chromosomes covered. You also are creating a BAM file. I thought you wanted to get fastq reads? You are also getting singe-end reads by this method (unless you set paired=t).

My suggestion would be to generate plenty of reads using randomreads.sh to get good coverage (pay attention to parameters for SNP etc they have default values) and then sample the number of reads you need from this large pool using reformat.sh.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

I will give this a try! What would you suggest for the SNP values. I never had to sample reads. The main goal is to simulate reads that mimic low-coverage data. That is also why I chose this extreme low coverage. But might be a better idea to subsample the coverage in a later step.

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

Even after I added your suggestions the problem persists. I can increase the memory usage further but did not observe any large spike when the script ran.

Furthermore I highly doubt that this is a problem with my physical memory limit. See :

awk '/^(MemTotal|SwapTotal)/{print $2}' /proc/meminfo
131818812
303038460

Is this maybe a problem how randomreads.sh calculates mem-usage?

randomreads.sh ref=alternate.fasta out=reads_1 length=120 coverage=30 simplenames=T illuminanames=T -Xmx100G -paired=T
    java -ea -Xmx100G -cp current/ align2.RandomReads3 build=1 ref=alternate.fasta out=reads1 length=120 coverage=30 simplenames=T illuminanames=T -Xmx100G -paired=T
    Executing align2.RandomReads3 [build=1, ref=alternate.fasta, out=reads1, length=120, coverage=30, simplenames=T, illuminanames=T, -Xmx100G, paired=T]

    Writing reference.
    Executing dna.FastaToChromArrays2 [alternate.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=500, nodisk=false]

    Set genScaffoldInfo=true
    Deleting chr6.chrom.gz
    Deleting chr3.chrom.gz
    Deleting chr5.chrom.gz
    Deleting chr1.chrom.gz
    Deleting chr4.chrom.gz
    Deleting chr2.chrom.gz
    Deleting chr7.chrom.gz
    Writing chunk 1
    Writing chunk 2
    Writing chunk 3
    Writing chunk 4
    Writing chunk 5
    Writing chunk 6
    Writing chunk 7
    Waiting for writing to finish.
    Finished.
    snpRate=0.0, max=0, unique=true
    insRate=0.0, max=0, len=(0-0)
    delRate=0.0, max=0, len=(0-0)
    subRate=0.0, max=0, len=(0-0)
    nRate  =0.0, max=0, len=(0-0)
    genome=1
    PERFECT_READ_RATIO=0.0
    ADD_ERRORS_FROM_QUALITY=true
    REPLACE_NOREF=false
    paired=true
    read length=120
    reads=387310432
    insert size=40-390

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

If you are also doing paired reads then explicitly indicate an out1=R1.fq.gz and out2=R2.fq.gz otherwise the reads would be written interleaved in a single file (out=).

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

Ok. How do I create R1.fq.gz. and R2.fq.gz from a .fasta file generated in the way described in the TO question?

ADD REPLY • link 19 months ago by berndmann ▴ 10

0

Entering edit mode

You should do

randomreads.sh ref=alternate.fasta out1=embryo1_reads_R1.fq.gz out2=embryo1_reads_R2.fq.gz   length=120 coverage=30 simplenames=T illuminanames=T -Xmx100G -paired=T

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

Well, ok. I missed the out parameter. Sorry for that kinda stupid question.

ADD REPLY • link 19 months ago by berndmann ▴ 10

score 0 · Answer 1 · 2023-06-28

0

Entering edit mode

19 months ago

GenoMax 148k

If you are drawing your conclusion that only 7 chromosomes are included based on this message in

    Deleting chr6.chrom.gz
    Deleting chr3.chrom.gz
    Deleting chr5.chrom.gz
    Deleting chr1.chrom.gz
    Deleting chr4.chrom.gz
    Deleting chr2.chrom.gz
    Deleting chr7.chrom.gz

log then be aware that the notation used here does not reflect actual chromosomes. BBMap tools concatenate multi-fasta files into internal file structure that is labeled with chrom* but that does not actually reflect chromosome names.

If you look at a normal BBMap index for a genome you will see file names that look like following. That is simply some internal file structure that BBMap uses. You should be getting reads from full genome.

$ ls BBMapIndex/ref/genome/1/
chr1.chrom.gz  chr2.chrom.gz  chr3.chrom.gz  chr4.chrom.gz  chr5.chrom.gz  chr6.chrom.gz  chr7.chrom.gz

ADD COMMENT • link 19 months ago by GenoMax 148k

0

Entering edit mode

I would like to dive deeper into the settings of randomreads.sh should I open a new question for this or is it fine to do this here if you are willing to help further?

ADD REPLY • link 19 months ago by berndmann ▴ 10

0

Entering edit mode

Check the in-line help (run randomreads.sh on its own to view). That may be sufficient to answer your questions. If you have specific questions post here and we can see.

ADD REPLY • link 19 months ago by GenoMax 148k

0

Entering edit mode

I'm still unsure if I think about this tool in the correct way. The main goal for me is to evaluate a genotype caller for a specific setup. For this setup I had to simulate a vcf in a specific way and now I want to generate low coverage reads that support the SNPs for those reads.

Is the the following pipeline still the way to go?

library(glue)
#Use FastaAlternateReferenceMaker to generate a FASTA file 

#https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker

call = glue("java -jar Genomics_prac_guide/tools/gatk4_2.jar gatk FastaAlternateReferenceMaker -R {reference} -O {output} -V {input}")
system(call)

#generate paired end reads that support the snps in the vcf

call = glue("Genomics_prac_guide/tools/bbmap/randomreads.sh ref=alternate.fasta out1=data/1kg_hg37/out/some_reads_R1.fq.gz out2=data/1kg_hg37/out/some_reads_R2.fq.gz   length=150 coverage=30 simplenames=T illuminanames=T -Xmx100G -paired=T")

system(call)

#align the reads to the same reference genome used to generate the the artifical fasta file and subsample to an extremly low coverage

#use only 1 thread to enure a deterministic output

call =  glue("Genomics_prac_guide/tools/bbmap/bbmap.sh ref=Genomics_prac_guide/reference/unsorted/hg19.fa in=data/1kg_hg37/some_reads_R1.fq.gz in2=data/1kg_hg37/some_reads_R2.fq.gz out=some_reads_determ.bam samplerate=0.0018 threads=1")
system(call)

call = glue("samtools sort some_reads.bam > some_reads_sort.bam")
system(call)

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

At initial glance this looks OK. You are generating 30x coverage (which would be average). If you are only going to use one of the reads for alignments then your coverage may drop in half so you may want to use 60x in step before. You should probably turn off maxsnps= and other insertion/deletion parameters if you want to add no additional mutations to simulated reads.

Are you getting results that look reasonable?

ADD REPLY • link 18 months ago by GenoMax 148k

0

Entering edit mode

After I generated the reads I used mpileup to extract the the single reads for each position. I compared how many positions which are supported with at least one read can be found in the vcf file which I used to generate the reads. I discovered that only 21100 of the ~1 million positions on chr22 of the sample are supported by reads. This is possibly due to the very low coverage. The big problem is: For 99.5% of those positions with reads where the sample in the vcf is actually hom-ref we observe the ALT allele as the read.

Could this by any means be a problem of samtools mpileup? If not how could I prevent this GenoMax ?

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

I split the fasta file I generated from the vcf by chromosomes . Here is the chromosome 22 of the fasta file and also the vcf file for chr22 of the sample. Maybe this is helpful to replicate my problem. https://we.tl/t-hSoXiGYUoC

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

GenoMax Did you find time to look into this problem? Still looking forward to simulate reads for my vcf.

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

Are you doing what I had suggested in a comment earlier in this thread (randomreads.sh only produces reads for chr1 to chr7 )?

I look at a lot of things on biostars so I may have lost track of what exactly is the current issue you are facing. This parent thread has been answered as far as I can tell.

ADD REPLY • link 18 months ago by GenoMax 148k

0

Entering edit mode

randomreads.sh only produces reads for chr1 to chr7 This is the current problem. I guess this happens since a fasta file only contains one haplotype of the variant which means if I create an AlternateFastaFile 50% of the information gets lost right? Does one has to create two AlternateFastaFile one containing the REF and one the ALT allele? If not how does bbmap actually add the genotype information from the vcf to the read?

ADD REPLY • link 18 months ago by berndmann ▴ 10

0

Entering edit mode

That seems logical. Randomreads does not know about alleles. Right now it is creating reads for both strands. You could turn on

samestrand=t    Generate paired reads on the same strand.

so the reads are generated only from top strand which should restrict them to allele you have in your fasta file.

Also set adderrors=f to limit adding errors.

ADD REPLY • link 18 months ago by GenoMax 148k