Sorry, my question is about where can I find the GRCH38-lite.fa to download. I do not have it downloaded yet. I have only GRCH37-lite.fa if you are asking me to grep GRCH37-lite.fa. Then this is how it looks like,
What are the chromosomes inside those two files? DNA chromosome
Is lite primary alignment? yes
Description from the README of GRCH37-lite.fa:
GRCh37-lite is a subset of the full GRCh37 human genome assembly (assembly accession GCA_000001405.1) plus the human mitochondrial genome reference sequence (the "rCRS") from Mitomap.org. This set of sequences excludes all the
alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns. This haploid representation of the genome is provided as a convenience for use in alignment pipelines that cannot handle the multiple placements expected in the PARs and in regions of the genome that are represented by the alternate loci.
The header
>1 CM000663.1 Homo sapiens chromosome 1, GRCh37 primary reference assembly
>2 CM000664.1 Homo sapiens chromosome 2, GRCh37 primary reference assembly
>3 CM000665.1 Homo sapiens chromosome 3, GRCh37 primary reference assembly
And the grep "^>" all_sequences.fa | head looks as the following:
You told me there were 5000 lines in your output in last message, I just see 84 and 123 count there... 84 entries should be 24 chromosomes + some unplaced/unlocated chromosomes so the file GRCh37-lite.fa is what is called primary file for GRCh38, the link I sent to you will be good.
I do not know what is inside all_sequences.fa, where did you download this one ? Seems like you have some viruses in there
What you are looking for is "Primary" assembly file.
Primary assembly contains all toplevel sequence regions excluding
haplotypes and patches. This file is best used for performing sequence
similarity searches where patch and haplotype sequences would confuse
analysis.
You can find those sequences here at Ensembl (large download) or NCBI (large download).
NCBI sequence contains the following:
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
A gzipped file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit. Note: the two PAR regions on chrY have been hard-masked with Ns. The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns (locations of the unmasked copies are given below).
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence Note: The EBV sequence is not part of the genome assembly but is included in the analysis set as a sink for alignment of reads that are often present in sequencing samples.
No, I am not looking for primary assembly,
I am looking for genome-lite assembly file. The GRCH37-lite.fa equivalent in hg38 version. And all_sequence.fa file's equivalent in hg38 version
GRCh37-lite is a subset of the full GRCh37 human genome assembly
(assembly accession GCA_000001405.1) plus the human mitochondrial
genome reference sequence (the "rCRS") from Mitomap.org. This set of
sequences excludes all the
alternate loci scaffolds of the full GRCh37 assembly, and has the pseudo-autosomal regions (PARs) on chromosome Y masked with Ns.
This haploid representation of the genome is provided as a convenience
for use in alignment pipelines that cannot handle the multiple
placements expected in the PARs and in regions of the genome that are
represented by the alternate loci.
this is the file you are looking for.
If you need the "all_sequences" i.e. including alt haplotypes then you should get the full sequence file from NCBI.
Note: Don't go on the fact hg38 files are not called lite or full. If you need those other viral sequences in new file then append them to the hg38 reference.
What are the chromosomes inside those two files ? Is
lite
primary alignment ?Can you copy to output of :
Sorry, my question is about where can I find the
GRCH38-lite.fa
to download. I do not have it downloaded yet. I have onlyGRCH37-lite.fa
if you are asking me to grepGRCH37-lite.fa
. Then this is how it looks like,What are the chromosomes inside those two files?
DNA chromosome
Is lite primary alignment?yes
Description from the README of
GRCH37-lite.fa
:The header
And the
grep "^>" all_sequences.fa | head
looks as the following:Yes I meant GRCh37-lite.fa sorry,
I want the list of all entries in these files, just remove the
head
in your command pleaseIf you only want primary assemblies you can take this one :
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.primary_assembly.genome.fa.gz
From : https://www.gencodegenes.org/human/
Thanks for the links. It is a big file. Do you want me to copy all lines here?
The lines are more than 5000 could you please tell me where should I post it or upload them?
5000 lines means 5000 chromosomes, alternatives, unplaced... That is a lot, I do not understand the difference between your 2 files
Could you try this :
grep -c "^>" GRCh37-lite.fa
84
grep -c "^>" all_sequence.fa
123
grep "^>" GRCh37-lite.fa | tail -10
grep "^>" all_sequences.fa | tail -10
You told me there were 5000 lines in your output in last message, I just see 84 and 123 count there... 84 entries should be 24 chromosomes + some unplaced/unlocated chromosomes so the file
GRCh37-lite.fa
is what is called primary file for GRCh38, the link I sent to you will be good.I do not know what is inside
all_sequences.fa
, where did you download this one ? Seems like you have some viruses in thereThe README looks like this
I downloaded it from :
ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/special_requests/