How does FASTQ format show a diploid sequence.
1
1
Entering edit mode
7.3 years ago
Evoevo ▴ 10

I've used a samtools, bcftools pipeline to generate a diploid consensus sequence. The consensus sequences are in fastq format. I expected that I'd get two sequences in the fastq files - one for each homologous chromosome. But when I open them, I can only see a single sequence identifier. How does FASTQ encode which base belongs to which homologous chromosomes? I can see a long string of n's in the middle of the sequence. Is that where they're separated?

Thanks!

Edited for clarity

Assembly sequence • 2.9k views
ADD COMMENT
1
Entering edit mode

What?

I've used samtools to assemble some diploid genomes.

samtools is not a genome assembler, you can't possibly have assembled a genome with it.

The assembled sequences are in fastq format.

Not impossible, but genome assemblies are almost always given in fasta format, not fastq.

How does FASTQ encode which base belongs to which homologous chromosomes?

It doesn't. Variation may be encoded in vcf format, or maybe fastg or gfa - the later two uncommon now but it will probably be prevalent in the near future. Currently, diploid genomes assemblies are generally represented as haploid fasta files.

I can see a long string of n's in the middle of the sequence. Is that where they're separated?

If your files do indeed represent a genome assembly, these runs of Ns probably represent scaffolds, that is, adjacent contigs with some undetermined sequence intervening, this undetermined sequence is filled up with Ns.

ADD REPLY
0
Entering edit mode

Sorry, I might be misusing the word 'assembled'. I'm just starting to play around with bioinformatics and I'm still wrapping my head around the jargon.

For clarity, I took a mapped .bam file from the 1000genomes project and ran it through samtools mpileup, passed that to bcftools call, then used vcfutils vcf2fq program to convert the resulting bcf file to fastq format (along with some additional filtering steps). What word should I have used rather than assembled? Variant called?

Thanks for your patience

ADD REPLY
0
Entering edit mode

Doesn't this require haplotype phasing?

ADD REPLY
3
Entering edit mode
6.8 years ago
FatihSarigol ▴ 260

It uses the ambiguity codes. For example "nucleotide" Y means that position is heterozygote for C and T nucleotides. If you have a Y in your sliding window, PSMC calls that window to have heterozygosity. SNPs are converted into these IUPAC codes to represent diploid information in a haploid-like single sequence in the prior step.

ADD COMMENT

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6