Hello Rob, I'd be super thankful if you could help me understand what's going on with my data because I'm going nuts.
I have metagenomic and metatranscriptomic data extracted from the same samples from the sea.
I want to do differential expression analysis on the metagenomes, from a mix of organisms, at different times.
I thought I would align DNA and RNA, salmon quant, tximport and then DESeq2. I'm so stuck and I hope you can help me get unstuck!
So what I've done so far: I did QC and assembly with the metaG data followed by annotation with MetaEuk, so I have a DNA.fasta file and a DNA.gff file. I used gffread to convert DNA.gff to DNA.gtf file. The paired metaT data I just QC'ed and merged.
I then used STAR to align the RNA.fastq to the DNA.fa using the DNA.gtf as reference, with these parameters (that I got from https://rdrr.io/bioc/BANDITS/f/README.md):
STAR --runMode genomeGenerate --runThreadN 6 --limitGenomeGenerateRAM 458444417974 --genomeSAindexNbases 13 --genomeDir path/genomeDir --genomeFastaFiles DNA.fa --sjdbGTFfile DNA.gtf
STAR --genomeDir path/genomeDir --runThreadN 6 --readFilesIn RNA.fastq --outFileNamePrefix path/STAR_Results/D1_R1_aln --outSAMtype BAM SortedByCoordinate --outSAMattributes Standard --quantMode TranscriptomeSAM
Then (not sure what this does exactly) again used gffread to "build a reference transcriptome (fasta format) compatible with the DNA fasta and gtf files used for STAR":
gffread -w cDNA.fa -g DNA.fa DNA.gtf
should this be with RNA.fastq instead? I tried that and it didn't work.
Then when I try to do salmon quant with the alignment:
salmon quant -t cDNA.fa -l A -a STAR_Results/D1_R1_alnAligned.sortedByCoord.out.bam -o salmon_quant -p 4 --dumpEq
It gives me the different lengths error mentioned above:
SAM file says target k75_658346 has length 629, but the FASTA file contains a sequence of length [532 or 531]
I think Bowtie 2 won't let me use a gtf reference, right? I thought aligning with gtf was better.
Btw I tried STAR without gtf and it also didn't work.
As for the salmon mapping-based mode, you need to build a decoy-aware transcriptome index (https://salmon.readthedocs.io/en/latest/salmon.html) which I don't know how to work because it requires a genome and a transcriptome, I think it's meant for ONE organism genome/transcriptome?
Would Stringtie or Scallop work for what I'm trying to do? I can't remember why anymore but when selecting what programmes to use, STAR and then salmon quant seemed like the best option.
As you can see I'm very lost and running out of ideas. Any explanation / more things to try would be highly appreciated!!
E
Thanks for reply,
The portion of GTF file, containing, transcript_id: ENST00000618686.1, looks like this,
GL000009.2 ENSEMBL transcript 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL exon 56140 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL CDS 58084 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL start_codon 58306 58308 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL stop_codon 58081 58083 . - 0 gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 56140 58083 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; GL000009.2 ENSEMBL UTR 58309 58376 . - . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "BX004987.1"; transcript_type "protein_coding"; transcript_name "BX004987.1-201"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic";