Hi all, I am new to bioinformatics, so I was wondering if someone can help me with some issues I have with cellranger. I'm trying to run cellranger count on Drosophila melanogaster data, but I need a transcriptome reference to run it. I use this link to create the transcriptome reference file using genome sequence (FASTA) and gene annotations (GTF). Based on that, in Ensembl, the recommended genome file to download is annotated as "primary assembly." In NCBI, it is "no alternative - analysis set." I couldn't find either of the titles on Ensemble or NCBI. I used a couple of different files (GTF and FASTA) on Flybase or NCBI, but I couldn't create a reference transcriptome using them as I got errors. Then, I tried below files, to create the reference:
ftp://ftp.ensemblgenomes.org/pub/metazoa/release46/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-77/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.77.gtf.gz
I managed to create the reference file, but when I run cellranger count using this reference transcriptome, I get an error for different replicates. To be more specific, the error is "Low Fraction Reads Confidently Mapped To Transcriptome" that says I got "19.0%, but Ideal > 30%. This can indicate the use of the wrong reference transcriptome, a reference transcriptome with overlapping genes, poor library quality, poor sequencing quality, or reads shorter than the recommended minimum. Application performance may be affected."
Could you please tell me where I can find a reference transcriptome or where I can find a better GTF and FASTA files to create the reference myself? I appreciate your response, thanks!
If it wasn't clear from the post.
Because your GTF file is in the old Drosophila genome (dm3) coordinate system and your .fasta file was the sequence for the newest Drosophila genome (dm6) - a huge number of genes' coordinates will be incorrect for your reference and are thus the most likely reason for your low fraction of mapped reads.
It was very clear, thank you very much for explaining the solution!
Thank you for your help! I got a warning after running cellranger count on two replicates (the third one worked just fine) using the files that you have mentioned. My warning says *"Low Fraction Reads in Cells which is because I got a 61.3%, but Ideal > 70%. Application performance may be affected. Many of the reads were not assigned to cell-associated barcodes. This could be caused by high levels of ambient RNA or by a significant population of cells with a low RNA content, which the algorithm did not call as cells. The latter case can be addressed by inspecting the data to determine the appropriate cell count and using --force-cells."*
Do you think it's a good idea to use --force-cells? I would really appreciate it if you have any recommendations to fix this.
Ideal is 100% but I frankly don't have much experience with 10X sequencing specifically. For other scRNA-seq technologies we see a huge variation in alignment %s. I would say if you are working with patient samples, especially in the case of disease, that the cell quality is often much lower. I personally would move forward with alignment rates over 50-60%. However, it would be wise to go in and make sure that there are good correlations between all the replicates. On the other hand if you are using something like cell lines... then this does seem a bit low.
If I was in your position, I would compare the results using "--force-cells" to the results without using it to see if I really believe in the added cells.
Since original question is about flies we can safely eliminate that possibility :-)