Question

Modifying fastq base at specific reference location on different length reads

0

Entering edit mode

5.3 years ago

yryan ▴ 10

Hi folks,

I'm interested in using oxford nanopore's taiyaki tool in order to train a new basecaller for modified bases at a known position. In order to train a new model basecaller I need to modify the fastq (or sam and convert back) for each fast5 file in order to signify this modified base. However I have around 10k reads, combined with minion's inherent error rate it's not really something I can edit in a regex way as far as I know.

Does anyone know of a method or script that can use a sam file aligned to a consensus where I can modify the base at a specific location which would get around the previous issues?

alignment next-gen sequence nanopore • 2.0k views

ADD COMMENT • link 5.3 years ago by yryan ▴ 10

GenoMax · Accepted Answer · 2020-01-16

1

Entering edit mode

5.3 years ago

Pierre Lindenbaum 166k

or script that can use a sam file aligned to a consensus where I can modify the base at a specific location

see How to introduce artificial mutation in bam

ADD COMMENT • link 5.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

that looks like just the thing, thanks!

ADD REPLY • link 5.3 years ago by yryan ▴ 10

0

Entering edit mode

please flag the question as answered if it fulfills your needs (green tick on the left)

ADD REPLY • link 5.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I was wondering if I could get a bit more help... When I run the command

java -jar /bioinformatics_tools/jvarkit/dist/biostar404363.jar -o modified.bam -p basecalled.vcf original.bam

The output is only partially converting all of my T's to N's for the first 30 or so entries, and the remainder (~6k) are not changing, even with no AF ratio in the VCF (below) which I'd assumed would convert all T's to N's?

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##samtoolsVersion=1.9+htslib-1.9
##samtoolsCommand=samtools mpileup -v -f reads.fasta basecalled/basedcalled_sorted.bam
##reference=file://reads.fasta
##contig=<ID=X,length=6000>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency among genotypes, for each ALT allele, in the same order as listed">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
X   4605    .   T   N   .   .   .

Using the samtools -tview command in the link only a small proportion are being converted to N's, and these are the reads at the end of the terminal output, all of those at the beginning are unchanged. Is there anything I can do to alter this?

Also I realise this may be a bit much to ask but would it be possible to allow for the use of non cannonical bases, say Y in this workflow as this would be a very useful tool in order to create a training set for nanopore basecalling for novel modifications.

ADD REPLY • link updated 5.3 years ago by GenoMax 151k • written 5.3 years ago by yryan ▴ 10

0

Entering edit mode

hard to answer without seeing the BAM and the VCF. Please use https://github.com/lindenb/jvarkit/issues , narrow the bam around the position please.

ADD REPLY • link 5.3 years ago by Pierre Lindenbaum 166k