Hello , I have to do RNAseq analysis of human cancer cell lines , for that I need to index human genome , as a refrence genome. I index the human genome gff file from thr NCBI.. during some lecture I have heard that ncbi human genome file has some patches etc that needs to be addressed before alignment but I didnt know how . I did indexing using the following code:
STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles $ref --sjdbGTFfile $gene --sjdbOverhang 149
the variables hold the path to the particular files where the ref is the .fna file , $gene hold the path of gtf file
The problem is basically , it got executed and i did feature count etc using this file. While doing the down analysis , when I read this index file i got the warning :
Warning message:
In readGFF(filepath, version = version, filter = filter) :
the value part of some of the tag value pairs contains embedded double-quotes
now when I am plotting the MDS plots , there are some gene id against which its not the it Gene Symbols. So I am getting confused and I think my indexing has not been done properly. and the gene ids is GeneID:124904915 and GeneID:124904544 like the first few numbers are same and they refer to a gene as well when I searched GeneID:12490
I am new to bioinformatics and I have spent quite alot of time on this can some body plz guide me through this
Can you link to the assembly and annotation files you are using? Also, you need to provide the code you ran from the point of indexing, through alignment, and to the point where you got the error in R.
I have added the codes as you said . I think I would need help in how to do indexing of ncbi human genome file ,i read this somewhere: STAR requires that patches and alternative haplotypes must be removed.. and not sure how to do it . Because to my understanding the problem is originating form the index file ..I may be wrong
I have used the NCBI gff.fna file from this link [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/][1] And i used the indexing code which is mentioned in the msg above
STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles $ref --sjdbGTFfile $gene --sjdbOverhang 14
then i did alignment using this code
` for file in "${files[@]}" do
done
echo "All done!"
`
then
i did feature count using the following code
` files=($(ls *.bam)) module load Subread
for file in "${files[@]}" do
Following is the code that I used for generating MDS plots
` for(i in 1:3){ Cl <- colnames(summary(decideTests(fit2)))[i]
glMDPlot(fit2,coef=i, counts=logCPM.filt, anno=d.filt$genes[,2:3], groups=group, samples=targets$Label, transform=F, status=decideTests(fit2), main=paste0("MD plot: ", "EP Treated Parental vs Treated Replicate",Cl), sample.cols = targets$col, html=Cl ,folder="Treated", launch=TRUE) } `
sessionInfo()`
Can you please remove all unnecessary code and format the rest properly? The question here is about STAR and reading something it returns, all the rest like DE testing etc is irrelevant and makes the post messy.