Blog:Why there is a long list of unique entries in the current release of Drosophila melanogaster r6.06 [flybase]
0
1
Entering edit mode
9.2 years ago

Morning everyone,

I pulled down the current genome release of D.melanogaster (r6.06) and when I indexed the fasta file, I noticed something strange.

samtools faidx dmel-all-chromosome-r6.06.fasta.gz
head dmel6.fai

211000022279114    14983    225    80    81
211000022280270    13108    15621    80    81
211000022280187    13079    29118    80    81
211000022280742    12513    42586    80    81
211000022280763    12001    55481    80    81
211000022280476    6900    67856    80    81
211000022280437    5968    75066    80    81
211000022278493    2442    81332    80    81
211000022278317    1970    84028    80    81
3Cen_mapped_Scaffold_50_D1686    23238    86290    80    81

wc -l dmel6.fai
1870 dmel6.fai

It's quite a big fasta index containing 1870 different entries. Initially, I was very confused what are these entries and was expecting normal chromosome indices, like

egrep -v '211|map' dmel6.fai
rDNA    76973    6    60    61
X    23542271    78265    60    61
2L    23513712    24012912    60    61
2R    25286936    47918524    60    61
3L    28110227    73626913    60    61
3R    32079331    102205648    60    61
4    1348131    134819638    60    61
Y    3667352    136190241    60    61
mitochondrion_genome    19524    139918738    60    61

Then, after checking the previous releases, I skimmed through the official release paper, The Release 6 reference sequence of the Drosophila melanogaster genome and it states there:

Chromosome assignments of small scaffolds There remain 1,863 scaffolds representing 6.4 Mb (5.5 Mb without N's) of genomic sequences that have not been incorporated into the chromosome arm assemblies. These scaffolds range in size from 544 bp to 88,768 bp. The sequences of 49 of these scaffolds representing 1.9 Mb (1.7 Mb without N's) have been improved by sequence finishing; the remainder are unimproved scaffolds from the WGS3 assembly (Table 2). He et al. (He et al. 2012) mapped many of the Release 5 unmapped scaffolds to chromosomal regions using a series of mutant Drosophila strains bearing large chromosomal deletions and rearrangements. They performed comparative genome hybridization using DNA isolated from embryos of specific mutant genotypes and oligonucleotide microarrays representing the entire Release 5 sequence. Oligonucleotide sequences that did not hybridize to DNA of a particular Downloaded from genome.cshlp.org on August 26, 2015 - Published by Cold Spring Harbor Laboratory Press 15 deletion genotype were inferred to map within the deleted region. These data were consistent with the Release 5 cytogenetic map, validating the method. In Release 6, the scaffolds that were not incorporated into chromosome arm sequences were assigned to "auxillary sequence files" based on the He et al. data: Xmm, 2CEN, 3CEN, Ymm, XYmm, and U (unmapped) (Methods). For each set, the corresponding Release 6 scaffolds are represented as a series of individual scaffold sequences in FASTA format.

So, I will just remove them from my fasta file so as not to complicate things and for having a streamlined downstream analysis (ChIP-Seq/RNA-Seq). I hope it helps someone in future wondering about this.

# removing these entries from fasta file, for this we can use samtool faidx again but filtering for regions
samtools faidx dmel-all-chromosome-r6.06.fasta.gz `egrep -v '211|map' dmel6.fai | cut -f1 | tr '\n' ' '` | gzip - > dmel-cleaned-chromosome-r6.06.fasta.gz

There is only one thing, I can't figure out right now. Why the number of lines differ in my cleaned fasta which I just made and the original one. ​

gunzip -c dmel-cleaned-chromosome-r6.06.fasta.gz | wc -l
2294087
# the original file is smaller, is the `samtools or I` am missing something!!
gunzip -c dmel-all-chromosome-r6.06.fasta.gz | wc -l
1799366

Lets check the region (number of bases and lines) covered by each chromosome.

parallel 'echo {} & samtools faidx dmel-all-chromosome-r6.06.fasta.gz {} | grep -v ">" | wc -cl' ::: `egrep -v '211|map*' dmel6.fai | cut -f1 | tr '\n' ' '`

For each chromosome, it outputs, the number of lines in the fasta file and number of characters which are essentially the number of bases covered for each entry.

rDNA
   1283   78256
X
392372 23934643
2L
391896 23905608
2R
421449 25708385
3L
468504 28578731
3R
534656 32613987
4
  22469 1370600
Y
  61123 3728475
mitochondrion_genome
    326   19850

which also slightly differs from the entries in the paper. But when I run, it's consistent with the paper.

samtools faidx dmel-cleaned-chromosome-r6.06.fasta.gz 
rDNA    76973    6    60    61
X    23542271    78265    60    61
2L    23513712    24012912    60    61
2R    25286936    47918524    60    61
3L    28110227    73626913    60    61
3R    32079331    102205648    60    61
4    1348131    134819638    60    61
Y    3667352    136190241    60    61
mitochondrion_genome    19524    139918738    60    61

HTH

Cheers

drosophila ngs ChIP-Seq genome fasta • 3.0k views
ADD COMMENT
1
Entering edit mode

Useful information. You can simplify the filter command a bit using pyfaidx:

faidx dmel-all-chromosome-r6.06.fasta --regex '(?!211|map).* | gzip - > dmel-cleaned-chromosome-r6.06.fasta.gz

Regarding the different number of lines, it's possible that samtools is re-wrapping your lines at a smaller length? I'm not sure if it does this, but pyfaidx preserves the input line lengths.

ADD REPLY
0
Entering edit mode

Thanks for the input Matt.

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6