Question

Extracting genes coded by a region of a chromosome

0

Entering edit mode

2.8 years ago

nagarsaggi ▴ 40

Hi there, I am trying to identify genes coded by a QTL region. I downloaded the reference genome (RefSeq) along with gff, gtf and cds files and used the sequence of the QTL flanking primers to locate the QTL region on chromosome2. Now I want to know the genes coded by this QTL region. I want to extract the full-length gene sequence coded by this QTL region from the genome CDS file. Any help would be appreciated!

gene sequnce extration • 1.8k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 151k • written 2.8 years ago by nagarsaggi ▴ 40

score 1 · Accepted Answer · 2022-08-18

1

Entering edit mode

2.8 years ago

lieven.sterck 15k

I would use a two-step procedure:

first use something like bedtools intersect to select the gene features that fall within your given interval (QTL). In a next step use that list of IDs to select CDSs from the CDS fasta file (eg with AGAT or with seqtk or grep or ... )

ADD COMMENT • link 2.8 years ago by lieven.sterck 15k

0

Entering edit mode

I tried something similar but did not work. Since my region of interest is on chromosome2 so used 'grep' to extract genes coded by chromosome2 from the whole genome gff/gtf file and made a chromosome2 specific gff/gtf. Then I tried the chrmosome2 gff file to extra the coding sequence from the whole genome cds file but couldn't get the sequence of the genes extracted. Probably I might not be doing it the right way. Could you please elaborate a bit on your suggestion?

ADD REPLY • link 2.8 years ago by nagarsaggi ▴ 40

0

Entering edit mode

Can you provide a few gene ID's or accession numbers (if from RefSeq)?

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

I only have the QTL region which I extracted from a refseq chickpea genome. Now I want use other file associated with the chickpea genome which as gff. gtf and cds file to extracted gene codded by the QTL region.

ADD REPLY • link 2.8 years ago by nagarsaggi ▴ 40

1

Entering edit mode

Do you have GCF_000331145.1_ASM33114v1_genomic.gtf.gz file from RefSeq (LINK)? Then once you identify the region that overlaps you can use EntrezDirect to get the sequences. Here is one example:

$ efetch -db nuccore -id XM_004485346 -format fasta
>XM_004485346.3 PREDICTED: Cicer arietinum protein phosphatase 1 regulatory subunit INH3-like (LOC101488545), transcript variant X1, mRNA
CTTCTTGAAACCAAAAAATGACAAAATAAGAACAACAACATATCAAGACACACAAAGGCTAAAATAACAC
AACTTACACTAAGATAATATTGGACCTGCTCTTGAGTTCGACATGACATTAACTCCCGAGTTCCTGTGAA
ACTTGGATCGGTAGAAATGTTGTAATCTTGATAAACAACTTTTGTTTTTAAAGACTTGGCAAAACGCAAT

You can also get the feature table file for this genome (LINK)

$ zgrep NC_021161.1 GCF_000331145.1_ASM33114v1_feature_table.txt.gz

gene    protein_coding  GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7476    10839   +                                       LOC101492476    101492476               3364            
mRNA            GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7476    10839   +       XM_004489033.3          XP_004489090.2  uncharacterized LOC101492476    LOC101492476    101492476               3220    3220    
CDS     with_protein    GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7490    10342   +       XP_004489090.2          XM_004489033.3  uncharacterized protein LOC101492476    LOC101492476    101492476               2853    950     
gene    protein_coding  GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     18008   19772   -                                       LOC101492809    101492809               1765

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

thanks, GenoMax for helping me out! I could download the feature file and grep the chromosome2. Now, how should I pass the chickpea genome files in the -db and a list of features/ids in the -id options of the efetch.

ADD REPLY • link 2.8 years ago by nagarsaggi ▴ 40

0

Entering edit mode

I manage to run the 'efetch' and could extract the gene sequence! I extracted the feature of the chromosome2 from whole genome feature file with 'grep' and then used the list of mRNA ids (XM_..) with the option -input (for the list of ids). Can I use other ids such as locus_id (LOC) and protein ids (XP_..) to extract the protein sequence? I tried 'efetch' with the other ids but did not work.

ADD REPLY • link 2.8 years ago by nagarsaggi ▴ 40

1

Entering edit mode

You need to use -db protein with XP_ ids since those are proteins. Will take a look at LOC later today.

ADD REPLY • link 2.8 years ago by GenoMax 151k