Question

generating fasta of genome sequencing reads

0

Entering edit mode

5.2 years ago

dgrace999 • 0

Hello all, Forgive me I am new to bioinformatics. I have performed shotgun sequencing on archival museum samples, so the quality and depth of sequencing is low due to the degraded quality of archival tissue from which the DNA was extracted. Yet for most samples, the full mitochondrial genome was represented in the reads at high depth. After correcting for numts, I began my analyses.

I need to generate fastas of the mitochondrial genome for use in software for generating haplotype networks (PopArt, Network), and these require sequence alignments as input.

I have used bwa mem to map the reads to the mitogenome, followed by samtools view | samtools sort | adna-ldup | picard addorreplacereadgroups. Then used bcftools mpileup | bcftools call, to call genotypes/variants. (by the way since the mitogenome is maternal, they are haploid, do not have to worry about calling alleles, I used --ploidy 1) I then used bcftools consensus on the vcf file to generate the consensus fasta of the mitogenome. However, it seems that bcftools consensus generates a sequence that is highly biased to the reference and all of my haplotype maps cluster the sequenced individuals around the reference individual. I realize this is because of the low quality of the samples. I have tried to use -M and -a options, and a variety of filtering schemes, but I am still seeing that my sequences have reference bases in them, instead of the bases that were sequenced. I have also tried samtools mpileup | vcfutils.pl vcf2fastq method on the bam files to generate the consensus, but this generates fasta with IUPAC ambiguity codes, such that variants and called bases are swamped by ambiguity, the alignments produced by them do not show any divergence between individuals, and my haplotype map is one big circle of all samples as one haplotype.

I next tried to use reference guided assembly, instead of read mapping, with SOAP2. I had thought that this would output a sequence of bases as they are assembled at each site in order, but these too I now understand I would have to convert to sam to bam to vcf, and again I am in the same boat as above.

So my question is, is there a way to compile the assembled or mapped reads into a consensus sequence where each site represents the most frequently appearing base at that site across all reads scaffolded at that site, and that will not insert reference bases. I basically want the sequence of the mitogenome as-is from the reads themselves (I do not expect there to be gaps as the breadth of coverage of the mitogenome for these samples was 100%). I have been scouring for options, but I cannot seem to find any method that will just output the sequence of the reads as they have been mapped or assembled. Does anyone know if something like this exists?

sequence • 1.1k views

ADD COMMENT • link 5.2 years ago by dgrace999 • 0