Morning everyone,
I pulled down the current genome release of D.melanogaster (r6.06) and when I indexed the fasta file, I noticed something strange.
samtools faidx dmel-all-chromosome-r6.06.fasta.gz
head dmel6.fai
211000022279114 14983 225 80 81
211000022280270 13108 15621 80 81
211000022280187 13079 29118 80 81
211000022280742 12513 42586 80 81
211000022280763 12001 55481 80 81
211000022280476 6900 67856 80 81
211000022280437 5968 75066 80 81
211000022278493 2442 81332 80 81
211000022278317 1970 84028 80 81
3Cen_mapped_Scaffold_50_D1686 23238 86290 80 81
wc -l dmel6.fai
1870 dmel6.fai
It's quite a big fasta index containing 1870 different entries. Initially, I was very confused what are these entries and was expecting normal chromosome indices, like
egrep -v '211|map' dmel6.fai
rDNA 76973 6 60 61
X 23542271 78265 60 61
2L 23513712 24012912 60 61
2R 25286936 47918524 60 61
3L 28110227 73626913 60 61
3R 32079331 102205648 60 61
4 1348131 134819638 60 61
Y 3667352 136190241 60 61
mitochondrion_genome 19524 139918738 60 61
Then, after checking the previous releases, I skimmed through the official release paper, The Release 6 reference sequence of the Drosophila melanogaster genome and it states there:
Chromosome assignments of small scaffolds There remain 1,863 scaffolds representing 6.4 Mb (5.5 Mb without N's) of genomic sequences that have not been incorporated into the chromosome arm assemblies. These scaffolds range in size from 544 bp to 88,768 bp. The sequences of 49 of these scaffolds representing 1.9 Mb (1.7 Mb without N's) have been improved by sequence finishing; the remainder are unimproved scaffolds from the WGS3 assembly (Table 2). He et al. (He et al. 2012) mapped many of the Release 5 unmapped scaffolds to chromosomal regions using a series of mutant Drosophila strains bearing large chromosomal deletions and rearrangements. They performed comparative genome hybridization using DNA isolated from embryos of specific mutant genotypes and oligonucleotide microarrays representing the entire Release 5 sequence. Oligonucleotide sequences that did not hybridize to DNA of a particular Downloaded from genome.cshlp.org on August 26, 2015 - Published by Cold Spring Harbor Laboratory Press 15 deletion genotype were inferred to map within the deleted region. These data were consistent with the Release 5 cytogenetic map, validating the method. In Release 6, the scaffolds that were not incorporated into chromosome arm sequences were assigned to "auxillary sequence files" based on the He et al. data: Xmm, 2CEN, 3CEN, Ymm, XYmm, and U (unmapped) (Methods). For each set, the corresponding Release 6 scaffolds are represented as a series of individual scaffold sequences in FASTA format.
So, I will just remove them from my fasta file so as not to complicate things and for having a streamlined downstream analysis (ChIP-Seq/RNA-Seq). I hope it helps someone in future wondering about this.
# removing these entries from fasta file, for this we can use samtool faidx again but filtering for regions
samtools faidx dmel-all-chromosome-r6.06.fasta.gz `egrep -v '211|map' dmel6.fai | cut -f1 | tr '\n' ' '` | gzip - > dmel-cleaned-chromosome-r6.06.fasta.gz
There is only one thing, I can't figure out right now. Why the number of lines differ in my cleaned fasta which I just made and the original one.
gunzip -c dmel-cleaned-chromosome-r6.06.fasta.gz | wc -l
2294087
# the original file is smaller, is the `samtools or I` am missing something!!
gunzip -c dmel-all-chromosome-r6.06.fasta.gz | wc -l
1799366
Lets check the region (number of bases and lines) covered by each chromosome.
parallel 'echo {} & samtools faidx dmel-all-chromosome-r6.06.fasta.gz {} | grep -v ">" | wc -cl' ::: `egrep -v '211|map*' dmel6.fai | cut -f1 | tr '\n' ' '`
For each chromosome, it outputs, the number of lines in the fasta file and number of characters which are essentially the number of bases covered for each entry.
rDNA
1283 78256
X
392372 23934643
2L
391896 23905608
2R
421449 25708385
3L
468504 28578731
3R
534656 32613987
4
22469 1370600
Y
61123 3728475
mitochondrion_genome
326 19850
which also slightly differs from the entries in the paper. But when I run, it's consistent with the paper.
samtools faidx dmel-cleaned-chromosome-r6.06.fasta.gz
rDNA 76973 6 60 61
X 23542271 78265 60 61
2L 23513712 24012912 60 61
2R 25286936 47918524 60 61
3L 28110227 73626913 60 61
3R 32079331 102205648 60 61
4 1348131 134819638 60 61
Y 3667352 136190241 60 61
mitochondrion_genome 19524 139918738 60 61
HTH
Cheers
Useful information. You can simplify the filter command a bit using pyfaidx:
Regarding the different number of lines, it's possible that samtools is re-wrapping your lines at a smaller length? I'm not sure if it does this, but pyfaidx preserves the input line lengths.
Thanks for the input Matt.