Extracting genes coded by a region of a chromosome
1
0
Entering edit mode
2.3 years ago
nagarsaggi ▴ 40

Hi there, I am trying to identify genes coded by a QTL region. I downloaded the reference genome (RefSeq) along with gff, gtf and cds files and used the sequence of the QTL flanking primers to locate the QTL region on chromosome2. Now I want to know the genes coded by this QTL region. I want to extract the full-length gene sequence coded by this QTL region from the genome CDS file. Any help would be appreciated!

gene sequnce extration • 1.4k views
ADD COMMENT
1
Entering edit mode
2.3 years ago

I would use a two-step procedure:

first use something like bedtools intersect to select the gene features that fall within your given interval (QTL). In a next step use that list of IDs to select CDSs from the CDS fasta file (eg with AGAT or with seqtk or grep or ... )

ADD COMMENT
0
Entering edit mode

I tried something similar but did not work. Since my region of interest is on chromosome2 so used 'grep' to extract genes coded by chromosome2 from the whole genome gff/gtf file and made a chromosome2 specific gff/gtf. Then I tried the chrmosome2 gff file to extra the coding sequence from the whole genome cds file but couldn't get the sequence of the genes extracted. Probably I might not be doing it the right way. Could you please elaborate a bit on your suggestion?

ADD REPLY
0
Entering edit mode

Can you provide a few gene ID's or accession numbers (if from RefSeq)?

ADD REPLY
0
Entering edit mode

I only have the QTL region which I extracted from a refseq chickpea genome. Now I want use other file associated with the chickpea genome which as gff. gtf and cds file to extracted gene codded by the QTL region.

ADD REPLY
1
Entering edit mode

Do you have GCF_000331145.1_ASM33114v1_genomic.gtf.gz file from RefSeq (LINK)? Then once you identify the region that overlaps you can use EntrezDirect to get the sequences. Here is one example:

$ efetch -db nuccore -id XM_004485346 -format fasta
>XM_004485346.3 PREDICTED: Cicer arietinum protein phosphatase 1 regulatory subunit INH3-like (LOC101488545), transcript variant X1, mRNA
CTTCTTGAAACCAAAAAATGACAAAATAAGAACAACAACATATCAAGACACACAAAGGCTAAAATAACAC
AACTTACACTAAGATAATATTGGACCTGCTCTTGAGTTCGACATGACATTAACTCCCGAGTTCCTGTGAA
ACTTGGATCGGTAGAAATGTTGTAATCTTGATAAACAACTTTTGTTTTTAAAGACTTGGCAAAACGCAAT

You can also get the feature table file for this genome (LINK)

$ zgrep NC_021161.1 GCF_000331145.1_ASM33114v1_feature_table.txt.gz

gene    protein_coding  GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7476    10839   +                                       LOC101492476    101492476               3364            
mRNA            GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7476    10839   +       XM_004489033.3          XP_004489090.2  uncharacterized LOC101492476    LOC101492476    101492476               3220    3220    
CDS     with_protein    GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     7490    10342   +       XP_004489090.2          XM_004489033.3  uncharacterized protein LOC101492476    LOC101492476    101492476               2853    950     
gene    protein_coding  GCF_000331145.1 Primary Assembly        chromosome      Ca2     NC_021161.1     18008   19772   -                                       LOC101492809    101492809               1765  
ADD REPLY
0
Entering edit mode

thanks, GenoMax for helping me out! I could download the feature file and grep the chromosome2. Now, how should I pass the chickpea genome files in the -db and a list of features/ids in the -id options of the efetch.

ADD REPLY
0
Entering edit mode

I manage to run the 'efetch' and could extract the gene sequence! I extracted the feature of the chromosome2 from whole genome feature file with 'grep' and then used the list of mRNA ids (XM_..) with the option -input (for the list of ids). Can I use other ids such as locus_id (LOC) and protein ids (XP_..) to extract the protein sequence? I tried 'efetch' with the other ids but did not work.

ADD REPLY
1
Entering edit mode

You need to use -db protein with XP_ ids since those are proteins. Will take a look at LOC later today.

ADD REPLY

Login before adding your answer.

Traffic: 1160 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6