Hi all,
I have a very simple question but I could not find the answer anywhere else.
Let's assume I have the human reference genome downloaded from Ensembl. The first line looks like this:
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
I understand that the lines after this line will have the reference DNA sequence for chromosome 1, from start position 1 to the end position 248956422. So, the fields after "GRCh38:" are {chromosome}:{start position}:{end position}:1 REF. What does the extra "1" represent after the "{end position}:"?
I am asking because I would like to extract the sequence for a subset of positions on the chromosome, and I would like to make sure I have the title of that sequence correct in my modified fasta file. For example, if I want to have the sequence only for one gene which is on chromosome 1 with a start position 1500 and end positon 2500, would the following be the correct title for the sequence?
>1 dna:chromosome chromosome:GRCh38:1:1500:2500:1 REF
{add here the sequence between the positions 2500 and 2500}
What does the extra :1 after the end position denote? Should I also modify it so that the alignment tools (e.g. STAR) interpret it correctly?
I would like to exclude all the remaining parts of the sequence because I know that no reads from those parts are present in my fastq files and I don't want any read to be mapped to those remaining parts by mistake. That's why, I'm trying to make it look like the whole sequence of chromosome 1 consists of only the sequence of that one gene.
It would be great if anyone could give feedback on this. Thanks a lot in advance!
Some more bacground if it helps: I would like to combine the reference sequence of one gene from one species with the whole genome sequence of another species, and I would like to keep the same formatting in the combined fasta file.
It's usually a bad idea. Exome Sequencing: Masking The Non-Genic Sequences ?