Entering edit mode
15 months ago
Arora
•
0
I wish to make a custom gtf file using a multiline fasta file which has multiple transcripts. e.g.,
>NM_001282823.1 prolactin receptor (PRLR), mRNA
GCCAAGAGACTGGGAGTCAAAGAAAGTTTCTGAAATCAGTGGATTCTGCTTGAGAACAGAGCCTGGTTAT
>NM_001682822.1 SNAP25 (SNAP25), mRNA
GCCAAGAGACTGGGAGTCAAAGAAAGTTTCTGAAATCAGTGGATTCTGCTTGAGAACAGAGCCTGGTTAT
>NM_001287822.1 CACNA1F (CACNA1F), mRNA
GCCAAGAGACTGGGAGTCAAAGAAAGTTTCTGAAATCAGTGGATTCTGCTTGAGAACAGAGCCTGGTTAT
Is there a way I could make a gtf file using the commands below as mentioned by 10x (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr#marker), but in a way it would output a gtf file containing information for all fasta entries rather than adding one by one?
cat NM_001282823.1 | grep -v "^>" | tr -d "\n" | wc -c
echo -e 'NM_001282823.1\tunknown\texon\t1\t922\t.\t+\t.\tgene_id "NM_001282823.1"; transcript_id "NM_001282823.1"; gene_name "NM_001282823.1"; gene_biotype "protein_coding";' > NM_001282823.1.gtf
Look for ways to loop over entries and write a GTF based on the FASTA header. BioPython might be useful here. 10X's method is not meant to be used to put together an entire GTF like you're doing right now, so that part is going to be on you.