pipeline for variant discovery in target resequencing
0
0
Entering edit mode
4.0 years ago

hello everyone, I'm new to bioinformatics and I'm developing a pipeline to identify variants in target re-sequencing. I thought about doing the analysis like this: fastq --> quality analysis with fastqc--> hg19 alignment with BWA MEM --> sam to bam with samtools-> sorted and indexing bam with samtools--> variant calling with vardict or mutect2. I decided not to perform the trimming because they are Ion Torrent data and I read the manual that ubam-ion torrent-files have already had trimmed. furthermore I have not even performed the deletion of duplicates because the sequencing library is based on amplicons and therefore I believe that deleting duplicates is harmful. unfortunately in the call of the variants I have too many variants even limiting the search areas using a bed file and filters, I believe this is due to the presence of false positives which I do not know how to eliminate except by eliminating the duplicates.

could someone suggest me the right way to proceed? thanks a lot! Sara

TS ion torrent variant calling • 1.1k views
ADD COMMENT
1
Entering edit mode

well if the duplicates are based on PCR then I'm not sure why you would say using amplicons would negate that. At any rate for variant calling I doubt removing duplicates would make a huge difference unless a sizeable proportion of duplicates were of such low quality they were leading to a variant not being called.

ADD REPLY
0
Entering edit mode

I mean that since it is target resequencing and specific areas of the genome have been amplified (an IonAmpliSeq panel was used) it is normal to have many reads that map to the same positions in the genome, but they may not be duplicates that are generated in the emulsion PCR . I tried to eliminate the duplicates and the number of variants found is much lower, but I don't know if this is right to do this also because in the quality analysis with fastqc the level of duplicates is very high (about 80%). thanks for you answer!

ADD REPLY
0
Entering edit mode

but they may not be duplicates that are generated in the emulsion PCR

You can't tell that for sure unless you have a unique molecular index (UMI) design.

ADD REPLY
0
Entering edit mode

I know, only with UMIs we are able to distinguish between true duplicates and unrelated reads with the same coordinates, but here we don't have UMIs...so my questions are: can I delete duplicates without losing important informations? or is there a way to limit false positives without deleting duplicates?

ADD REPLY
0
Entering edit mode

What does the ts tag mean?

By the way, if you're starting from scratch, why not use GRCh38 instead of hg19?

ADD REPLY
0
Entering edit mode

ts=target re-sequensing.

are there big differences between GRCh38 and hg19? thank you!

ADD REPLY
0
Entering edit mode

Thanks, I've never heard of the ts abbreviation.

Yes, hg19 is really really old (~13 years). Even GRCh38 was released in 2013. GRCh38 is derived from a larger pool of people and hence is a better "reference". Plus, patches to GRCh38 make unambiguous mapping better. There's also the fact that most annotation resources have switched to GRCh38 co-ordinates as defaults. hg19 is only useful for legacy reasons.

You can google the differences and why the newer version is better. Here is a good read: https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19

ADD REPLY
0
Entering edit mode

about TS, I've read this abbreviation in this interesting article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6861594/#b0130 Thank a lot for your answer!

ADD REPLY

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6