Hi there, I am trying to identify genes coded by a QTL region. I downloaded the reference genome (RefSeq) along with gff, gtf and cds files and used the sequence of the QTL flanking primers to locate the QTL region on chromosome2. Now I want to know the genes coded by this QTL region. I want to extract the full-length gene sequence coded by this QTL region from the genome CDS file. Any help would be appreciated!
I tried something similar but did not work. Since my region of interest is on chromosome2 so used 'grep' to extract genes coded by chromosome2 from the whole genome gff/gtf file and made a chromosome2 specific gff/gtf. Then I tried the chrmosome2 gff file to extra the coding sequence from the whole genome cds file but couldn't get the sequence of the genes extracted. Probably I might not be doing it the right way. Could you please elaborate a bit on your suggestion?
Can you provide a few gene ID's or accession numbers (if from RefSeq)?
I only have the QTL region which I extracted from a refseq chickpea genome. Now I want use other file associated with the chickpea genome which as gff. gtf and cds file to extracted gene codded by the QTL region.
Do you have GCF_000331145.1_ASM33114v1_genomic.gtf.gz file from RefSeq (LINK)? Then once you identify the region that overlaps you can use EntrezDirect to get the sequences. Here is one example:
You can also get the feature table file for this genome (LINK)
thanks, GenoMax for helping me out! I could download the feature file and grep the chromosome2. Now, how should I pass the chickpea genome files in the -db and a list of features/ids in the -id options of the efetch.
I manage to run the 'efetch' and could extract the gene sequence! I extracted the feature of the chromosome2 from whole genome feature file with 'grep' and then used the list of mRNA ids (XM_..) with the option -input (for the list of ids). Can I use other ids such as locus_id (LOC) and protein ids (XP_..) to extract the protein sequence? I tried 'efetch' with the other ids but did not work.
You need to use
-db protein
with XP_ ids since those are proteins. Will take a look at LOC later today.