Entering edit mode
2.7 years ago
adR
▴
120
Hi all,
Just wonder to know about these two questions? what is the main difference between the two genome files (Homo_sapiens.GRCh38.dna.primary_assembly.fa and Homo_sapiens.GRCh38.dna.fa) located in the ensemble database? which one should I use for whole-exome sequence alignment?
I used Homo_sapiens.GRCh38.dna.fa for the alignment, and later on, when I did future count using featureCounts function as below, the whole matrix was zero. Just wondering in case Homo_sapiens.GRCh38.dna.fa was the wrong file for my alignment.
featureCounts -t exon -g gene_id -a Homo_sapiens.GRCh38.105.gtf -o Ensembl_counts_gtf.txt *.bam
Best, amare
Have you read the README in ensembl website?
yes, I did but could not able to understand it.
Well, first of all, I don't see any file in the repository called "Homo_sapiens.GRCh38.dna.fa", so I guess the file you have is "Homo_sapiens.GRCh38.dna.toplevel.fa". The difference between Homo_sapiens.GRCh38.dna.toplevel.fa.gz and Homo_sapiens.GRCh38.dna.primary_assembly.fa is that the second excludes the alternative (haplotypes) and unassembled sequences. This link is old, but maybe it helps you understand the files and how to make use of them.
Make sure when you run
featureCounts
that the GTF and the genome fasta file share the same chromosome names.Also see: See: Why is human genome FASTA file on GENCODE much smaller than that on ENSEMBL?