Question

What is the major problem with this pipeline of SNPs analysis?

0

Entering edit mode

2.2 years ago

kgwkk2 • 0

First, I have several Aspergillus flavus (A kind of fungi species) illumina sequencing raw data as pair of fastq.gz file (sample1_filtered_1.fastq.gz and sample1_filtered_2.fastq.gz). And I wanted to assemble illumina fragment sequences and make SNP(single nucleotide polymorphism) analysis with the reference genome, Aspergillus flavus NRRL3357 as fasta file. At the end of the analysis, vcf (variant calling file) file had to be generated. So I made the pipeline based on old book(2018) and help of chatGPT.

Here is the command line I consisted.

1. Make index of reference genome.

bwa index -a bwtsw NRRL3357.fa

2. Assemble illumina fragments with reference genome and generate SAM format file.

bwa mem -R "@RG\tID:BP2-1\tSM:BP2-1\tPL:ILLUMINA\tPI:330" NRRL3357.fa BP2-1_filtered_1.fastq.gz BP2-1_filtered_2.fastq.gz > BP2-1.sam

3. Convert SAM format into BAM format.

samtools view -bhS BP2-1.sam > BP2-1.bam

4. Sorting and mark duplicated fragments

PICARD SortSam \
I=BP2-1.bam \
O=sorted_BP2-1.bam \
SO=coordinate 

PICARD MarkDuplicates \
I=sorted_BP2-1.bam \
O=dedup_sorted_BP2-1.bam \
METRICS_FILE=duplicates \
REMOVE_DUPLICATES=true \
CREATE_INDEX=True

5. Make reference genome dictionary for GATK.

PICARD CreateSequenceDictionary \
R=NRRL3357.fa \
O=NRRL3357.dict

6. Make reference genome index for GATK

samtools faidx NRRL3357.fa

7. Comparing BAM file of sample and reference genome fasta and generate VCF file.

gatk HaplotypeCaller \
-R NRRL3357.fa \
-I dedup_sorted_BP2-1.bam \
-O BP2-1.vcf
-ERC GVCF
-stand_call_conf 30.0

8. VCF filtering process.

gatk VariantFiltration \
  -V 100-8.vcf \
  --filter-expression "QD < 2.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || FS > 60.0 || SOR > 3.0" \
  --filter-name "my_snp_filter" \
  -O 100-8_filtered.vcf

The programs worked with no error, but some warning message. And the running time was too short and final VCF files contents were just like this.

#CHROM, POS, ID, RED, ALT, QUAL, FILTER, INFO, FORMAT, BP2-1
NC 054691.1, 1, -, T, <NON_REF>, -, -, END=4 GT:DP:GO:MIN_DP:PL, 0/0:1:0:1:0,0,0

The VCF file I wanted was like this.

#CHROM, POS, ID, RED, ALT, QUAL, FILTER, INFO, FORMAT, BP2-1
NC_036435.1,    330,    -, A, G, 213, -, DP=226;VDB=0.990697;SGB=164.624;RPB=0.42072;MQB=1;MQSB=1;BQB=0.0015177;MQ0F=0;ICB=0.416667;HOB=0.375;AC=2;AN=8;DP4=110,29,66,12;MQ=60, GT:PL:DP4, ./.:0,0,0:0,0,0,0

So what is the major problem with my command-line?

Genomics Fungi SNP VCF • 1.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 2.2 years ago by kgwkk2 • 0

0

Entering edit mode

I've formatted your post so it looks better. Please check to make sure nothing is lost in translation. - I removed a few backslashes from the 2 VCF blocks and the CreateSequenceDictionary block.

Also, how well do you know variant calling? "Old book + ChatGPT" gives us no idea of your level of knowledge.

ADD REPLY • link 2.2 years ago by Ram 45k

0

Entering edit mode

Thanks for your help :)

ADD REPLY • link 2.2 years ago by kgwkk2 • 0

0

Entering edit mode

I read basic Genome data analysis book for beginner and I just followed the command-line to generate VCF file, but there was lot of error, so I got assistance from chatGPT. That's all I can explain.

ADD REPLY • link 2.2 years ago by kgwkk2 • 0

Ram · Accepted Answer · 2023-03-31

7

Entering edit mode

2.2 years ago

Pierre Lindenbaum 166k

this is your main problem

gatk HaplotypeCaller \
-R NRRL3357.fa \
-I dedup_sorted_BP2-1.bam \
-O BP2-1.vcf
-ERC GVCF
-stand_call_conf 30.0

and call to HaplotypeCaller in GVCF mode should be followed by GATK GenotypeGVCF

So I made the pipeline based on old book(2018)

why and old tutorial when there is https://www.htslib.org/workflow/wgs-call.html

ADD COMMENT • link 2.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you for your comment. So you mean

gatk GenotypeGVCFs \
   -R NRRL3357.fa \
   -V BP2-1.vcf \
   -O BP2-1.vcf.gz

should be added?

Or does that mean bcftools will fit more than gatk?

ADD REPLY • link updated 2.2 years ago by Ram 45k • written 2.2 years ago by kgwkk2 • 0

0

Entering edit mode

you can stick to gatk. But, again, using a GVCF without GenotypeGVCF is meaningless

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting

ADD REPLY • link 2.2 years ago by Ram 45k