How to interpret the title of each sequence in a reference genome fasta file downloaded from Ensembl
1
0
Entering edit mode
19 months ago
hande • 0

Hi all,

I have a very simple question but I could not find the answer anywhere else.

Let's assume I have the human reference genome downloaded from Ensembl. The first line looks like this:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF

I understand that the lines after this line will have the reference DNA sequence for chromosome 1, from start position 1 to the end position 248956422. So, the fields after "GRCh38:" are {chromosome}:{start position}:{end position}:1 REF. What does the extra "1" represent after the "{end position}:"?

I am asking because I would like to extract the sequence for a subset of positions on the chromosome, and I would like to make sure I have the title of that sequence correct in my modified fasta file. For example, if I want to have the sequence only for one gene which is on chromosome 1 with a start position 1500 and end positon 2500, would the following be the correct title for the sequence?

>1 dna:chromosome chromosome:GRCh38:1:1500:2500:1 REF
{add here the sequence between the positions 2500 and 2500}

What does the extra :1 after the end position denote? Should I also modify it so that the alignment tools (e.g. STAR) interpret it correctly?

I would like to exclude all the remaining parts of the sequence because I know that no reads from those parts are present in my fastq files and I don't want any read to be mapped to those remaining parts by mistake. That's why, I'm trying to make it look like the whole sequence of chromosome 1 consists of only the sequence of that one gene.

It would be great if anyone could give feedback on this. Thanks a lot in advance!

Some more bacground if it helps: I would like to combine the reference sequence of one gene from one species with the whole genome sequence of another species, and I would like to keep the same formatting in the combined fasta file.

reference fasta dna alignment genome sequence • 954 views
ADD COMMENT
0
Entering edit mode

I would like to exclude all the remaining parts of the sequence because I know that no reads from those parts are present in my fastq files and I don't want any read to be mapped to those remaining parts by mistake.

It's usually a bad idea. Exome Sequencing: Masking The Non-Genic Sequences ?

ADD REPLY
0
Entering edit mode
19 months ago
Emily 24k

1 means positive strand sequence. -1 means negative strand sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 2080 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6