I've used a samtools, bcftools pipeline to generate a diploid consensus sequence. The consensus sequences are in fastq format. I expected that I'd get two sequences in the fastq files - one for each homologous chromosome. But when I open them, I can only see a single sequence identifier. How does FASTQ encode which base belongs to which homologous chromosomes? I can see a long string of n's in the middle of the sequence. Is that where they're separated?
Thanks!
Edited for clarity
What?
samtools
is not a genome assembler, you can't possibly have assembled a genome with it.Not impossible, but genome assemblies are almost always given in fasta format, not fastq.
It doesn't. Variation may be encoded in vcf format, or maybe fastg or gfa - the later two uncommon now but it will probably be prevalent in the near future. Currently, diploid genomes assemblies are generally represented as haploid fasta files.
If your files do indeed represent a genome assembly, these runs of
Ns
probably represent scaffolds, that is, adjacent contigs with some undetermined sequence intervening, this undetermined sequence is filled up withNs
.Sorry, I might be misusing the word 'assembled'. I'm just starting to play around with bioinformatics and I'm still wrapping my head around the jargon.
For clarity, I took a mapped .bam file from the 1000genomes project and ran it through samtools mpileup, passed that to bcftools call, then used vcfutils vcf2fq program to convert the resulting bcf file to fastq format (along with some additional filtering steps). What word should I have used rather than assembled? Variant called?
Thanks for your patience
Doesn't this require haplotype phasing?