Question

Query in indexing human genome

0

Entering edit mode

19 months ago

mavy ▴ 10

Hello , I have to do RNAseq analysis of human cancer cell lines , for that I need to index human genome , as a refrence genome. I index the human genome gff file from thr NCBI.. during some lecture I have heard that ncbi human genome file has some patches etc that needs to be addressed before alignment but I didnt know how . I did indexing using the following code:

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles $ref --sjdbGTFfile $gene --sjdbOverhang 149

the variables hold the path to the particular files where the ref is the .fna file , $gene hold the path of gtf file

The problem is basically , it got executed and i did feature count etc using this file. While doing the down analysis , when I read this index file i got the warning :

Warning message:
In readGFF(filepath, version = version, filter = filter) :
  the value part of some of the tag value pairs contains embedded double-quotes

now when I am plotting the MDS plots , there are some gene id against which its not the it Gene Symbols. So I am getting confused and I think my indexing has not been done properly. and the gene ids is GeneID:124904915 and GeneID:124904544 like the first few numbers are same and they refer to a gene as well when I searched GeneID:12490

I am new to bioinformatics and I have spent quite alot of time on this can some body plz guide me through this

human indexing STAR • 1.0k views

ADD COMMENT • link 19 months ago by mavy ▴ 10

0

Entering edit mode

Can you link to the assembly and annotation files you are using? Also, you need to provide the code you ran from the point of indexing, through alignment, and to the point where you got the error in R.

ADD REPLY • link 19 months ago by rpolicastro 13k

0

Entering edit mode

I have added the codes as you said . I think I would need help in how to do indexing of ncbi human genome file ,i read this somewhere: STAR requires that patches and alternative haplotypes must be removed.. and not sure how to do it . Because to my understanding the problem is originating form the index file ..I may be wrong

ADD REPLY • link 19 months ago by mavy ▴ 10

0

Entering edit mode

I have used the NCBI gff.fna file from this link [https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/][1] And i used the indexing code which is mentioned in the msg above

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles $ref --sjdbGTFfile $gene --sjdbOverhang 14

then i did alignment using this code

` for file in "${files[@]}" do

    # extract filename and launch job
    input=`echo "$file" | cut -d '.' -f1`
    filename="$input"

    STAR --genomeDir /home/labs/........../FASTQ/ep/GCF_GRCh38/ \
         --outTmpDir /scratch/${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID} \
         --quantMode GeneCounts \
         --sjdbGTFfile $genes \
         --sjdbOverhang 149 \
         --outSAMtype BAM SortedByCoordinate \
         --readFilesIn /home/labs/......................../${filename}.fastq \
         --runThreadN 12 \
         --outStd BAM_SortedByCoordinate > ${filename}.trim.bam

done

echo "All done!"

`

then

i did feature count using the following code

` files=($(ls *.bam)) module load Subread

for file in "${files[@]}" do

    # extract filename and launch job
    input=`echo "$file" | cut -d '.' -f1`

    filename=$input

    /home/....../Subread/1.5.2-IGB-gcc-4.9.4/bin/featureCounts -T 4 -s 0  -t exon -g gene_id \
            -a /home/labs/................................./data/reference/GCF_000001405.40_GRCh38.p14_genomic.gtf \
            -o /home/labs/.......................featurecount_results/${filename}_featureCounts.txt \
               /home/labs/........................../GRCh38/${filename}.trim.bam

Following is the code that I used for generating MDS plots

` for(i in 1:3){ Cl <- colnames(summary(decideTests(fit2)))[i]

glMDPlot(fit2,coef=i, counts=logCPM.filt, anno=d.filt$genes[,2:3], groups=group, samples=targets$Label, transform=F, status=decideTests(fit2), main=paste0("MD plot: ", "EP Treated Parental vs Treated Replicate",Cl), sample.cols = targets$col, html=Cl ,folder="Treated", launch=TRUE) } `

sessionInfo()`

ADD REPLY • link 19 months ago by mavy ▴ 10

0

Entering edit mode

Can you please remove all unnecessary code and format the rest properly? The question here is about STAR and reading something it returns, all the rest like DE testing etc is irrelevant and makes the post messy.

ADD REPLY • link 19 months ago by ATpoint 85k