Editing Human Reference Genome by adding a CDS
1
0
Entering edit mode
7 months ago
LDT ▴ 340

I have a human genome assembly and I'd like to add a new CDS (coding sequence) as a small scaffold at the end of the genome. I then want to generate a new FASTA, GFF, and transcriptome file with this update.

I've tried using Geneious, but have been struggling with the gff formatting. I'm not sure if that's the best approach, so I'm looking for guidance on the optimal way to achieve this.

Some key questions I have:

  • What is the best way to add a new CDS sequence as a scaffold to an existing human genome assembly?
  • How can I then regenerate the FASTA, GFF, and transcriptome files to incorporate this new scaffold?
  • Are there any particular tools or workflows you would recommend for this type of genome editing and file generation?

Any advice or suggestions would be greatly appreciated.

Thank you in advance for your help!

gff agat transcriptome • 651 views
ADD COMMENT
1
Entering edit mode
7 months ago
Michael 55k

In principle, it shouldn't be a problem at all. Simply open your reference FASTA and GFF files in a text editor and add the sequence you want to add.

>my_unplaced_scaffold
ACGT...

For the GFF file it is a bit more complicated because you need to complete the gene structure (gene, transcript, exon, CDS). I won't go into detail because chances are you don't need to do it. As it is a single CDS that would be four additional lines to add.

The question is why and IF you should do that and I suggest it should be the absolute exception. Is the sequence really on an unplaced contig that is not part of the reference or are you trying to work your way around looking for it thoroughly? Did you search for the nucleotide sequence and are sure it is not in the genome? The human reference genome is pretty complete. Is it possibly part of an unplaced scaffold? Then you should download the version of the reference which includes unplaced sequences and search it there.

IF indeed the sequence is in the reference but not annotated, use a tool like exonerate or gmap to derive its coordinate in the genome. Gmap will give you GFF output which you can simply copy-paste into your reference after some minor adatation.

ADD COMMENT
0
Entering edit mode

Thank you Michael. I need to add it, I think, to the genome. It's a plasmid that I've put into the cells, and I want to check its transcription together with the human genes. I need to create a genome.fasta, a GFF, and a transcriptome.fasta file to run Kallisto, since I cannot use multiple references in Kallisto. Am I right?

ADD REPLY
1
Entering edit mode

If it's a plasmid in a transfected cell, I would indeed add the whole plasmid sequence including the insert. Most likely you have gotten a sequence and annotation file for the construct from your provider. If not, I would request one. After you got that file, you should export it to GFF and FASTA. Which tool to use depends on the format. FASTA and GFF files are text files, so you can simply use cat reference.fasta plasmid.fasta > ref_plasmid.fasta (same with GFF) and it will likely work.

ADD REPLY

Login before adding your answer.

Traffic: 2567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6