Question

How does FASTQ format show a diploid sequence.

1

Entering edit mode

7.3 years ago

Evoevo ▴ 10

I've used a samtools, bcftools pipeline to generate a diploid consensus sequence. The consensus sequences are in fastq format. I expected that I'd get two sequences in the fastq files - one for each homologous chromosome. But when I open them, I can only see a single sequence identifier. How does FASTQ encode which base belongs to which homologous chromosomes? I can see a long string of n's in the middle of the sequence. Is that where they're separated?

Thanks!

Edited for clarity

Assembly sequence • 2.9k views

ADD COMMENT • link updated 6.8 years ago by FatihSarigol ▴ 260 • written 7.3 years ago by Evoevo ▴ 10

1

Entering edit mode

What?

I've used samtools to assemble some diploid genomes.

samtools is not a genome assembler, you can't possibly have assembled a genome with it.

The assembled sequences are in fastq format.

Not impossible, but genome assemblies are almost always given in fasta format, not fastq.

How does FASTQ encode which base belongs to which homologous chromosomes?

It doesn't. Variation may be encoded in vcf format, or maybe fastg or gfa - the later two uncommon now but it will probably be prevalent in the near future. Currently, diploid genomes assemblies are generally represented as haploid fasta files.

I can see a long string of n's in the middle of the sequence. Is that where they're separated?

If your files do indeed represent a genome assembly, these runs of Ns probably represent scaffolds, that is, adjacent contigs with some undetermined sequence intervening, this undetermined sequence is filled up with Ns.

ADD REPLY • link 7.3 years ago by h.mon 35k

0

Entering edit mode

Sorry, I might be misusing the word 'assembled'. I'm just starting to play around with bioinformatics and I'm still wrapping my head around the jargon.

For clarity, I took a mapped .bam file from the 1000genomes project and ran it through samtools mpileup, passed that to bcftools call, then used vcfutils vcf2fq program to convert the resulting bcf file to fastq format (along with some additional filtering steps). What word should I have used rather than assembled? Variant called?

Thanks for your patience

ADD REPLY • link 7.3 years ago by Evoevo ▴ 10

0

Entering edit mode

Doesn't this require haplotype phasing?

ADD REPLY • link 7.3 years ago by Michael 55k

score 3 · Answer 1 · 2018-03-16

It uses the ambiguity codes. For example "nucleotide" Y means that position is heterozygote for C and T nucleotides. If you have a Y in your sliding window, PSMC calls that window to have heterozygosity. SNPs are converted into these IUPAC codes to represent diploid information in a haploid-like single sequence in the prior step.