gene sequence fetching from bam or fastq file of rna seq data
1
1
Entering edit mode
5.8 years ago

I have 6 wheat samples sequenced using RNA-seq. I received forward and reverse fastq files and I generated bam files by using hisat2 tool which are aligned with the reference wheat genome. I have been asked to build multiple sequence alignment for 3 genes from this sequenced rna seq data. I believe I need to select a gene sequence from all the samples and do a multiple sequence alignment. But I am struck in fetching the gene sequence from bam files. How do I select one gene sequence for all the 6 samples? Any suggestions? kidly send me commands for fetching the sequence of genes from this RNA seq data in fastq file or aligned.bam file.

rna-seq gene genome alignment sequence • 2.7k views
ADD COMMENT
0
Entering edit mode

are you looking for isoforms? read about transcript assembly with stringtie, or isoform detection, once detected or assembled you can extract the sequences and align them.

ADD REPLY
0
Entering edit mode

Hi buffo,

I have run the strigtie and got the assembled transcripts.next i want to fetch the transcripts sequences from all six sample .how can I do?

ADD REPLY
0
Entering edit mode

you can use gff_read to do that

ADD REPLY
0
Entering edit mode

Hi Buffo,

        I have run the stringtie tool and five files have been generated.

1 GTF file containing the assembled transcripts 2 Gene abundances in tab-delimited format 3 Fully covered transcripts that match the reference annotation, in GTF format 4 Files required as input to Ballgown 5 a merged GTF file from a set of GTF files. I did not use the gff_read tool. I have read the manual of it but did not understand. Can you please set the command for me. The command given in manual is gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

In this command i did not understand which genomic file and transcripts.fa file I have to use.

ADD REPLY
0
Entering edit mode

Hi Buffo,

I have run the stringtie tool and five files have been generated.

1 GTF file containing the assembled transcripts

2 Gene abundances in tab-delimited format

3 Fully covered transcripts that match the reference annotation, in GTF format

4 Files required as input to Ballgown

5 a merged GTF file from a set of GTF files.

I did not use the gff_read tool. I have read the manual of it but did not understand. Can you please set the command for me. The command given in manual is

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

In this command i did not understand which genomic file and transcripts.fa file I have to use.

ADD REPLY
0
Entering edit mode

The -w option refers to the output file name, -g needs the genome reference (you use it to map your reads) and the file transcripts.gtf is the option 1 of the above list (1 GTF file containing the assembled transcripts).

ADD REPLY
0
Entering edit mode

Hi buffo, I have genrated the sequence file (transcript.fa). this file contain sequence ids like

STRG.14.145 gene =STRG.14

Here i have write the id of first sequence in file. My gene of interest id is TraesCS3D02G273600. I want to do extract this id sequence from the file i have generated (transcript.fa) .Kindly tell me how can i extract the sequence of transcript of gene of interest. I

ADD REPLY
0
Entering edit mode

Hi buffo , please reply me

ADD REPLY
0
Entering edit mode

If you are looking for a single gene, type in the shell prompt:

grep -A 1 'gene_id' transcript.fa

It looks for the specific gene id in the transcript.fa file and shows the first line after the gene id, which is the sequence of the transcript.

ADD REPLY
0
Entering edit mode

Hi buffo , this command grep -A 1 'gene_id' transcript.fa display only one line of sequence , i want to extract the full length of gene sequence

ADD REPLY
0
Entering edit mode

Hi Buffo, Thank you.I have the gene sequences now.but these are the same sequence as present in database. After sequence alignment i did not any variation in six samples of target gene. why it is same?

ADD REPLY
0
Entering edit mode

load your bam files in IGV and check if that gene has SNPs among samples. That sequence is the same because the genome reference is exactly the same.

ADD REPLY
0
Entering edit mode
5.7 years ago

Hi Buffo, Thanks, I have run the srtigtie.got this error. kindly check this error.i did not get it and solve it .Kindly set this commandaccording to my sample.

./stringtie G1_sorted.bam -B -o G1.gtf -G Triticum_aestivum.IWGSC.42.gtf -p 4 -C G1.refs.gtf -A G1.abund.tab -WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences

ADD COMMENT
0
Entering edit mode

If you use -c

-C output a file with reference transcripts that are covered by reads

And gets

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!

It literaly means that you reads have mapped to regions without annotated transcripts. Be sure you are using the correct annotation. column 3 = Transcript

ADD REPLY
0
Entering edit mode

Now I run this command without the -c option .again got the same error. suggest me a solution .

./stringtie G1_sorted.bam -G Triticum_aestivum.IWGSC.42.gtf -l G1-Label -o G1_ST.gtf -p 15 WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.

ADD REPLY
0
Entering edit mode

here is gtf file

!genome-build IWGSC

!genome-version IWGSC

!genome-date 2018-07

!genome-build-accession GCA_900519105.1

3B IWGSC gene 212892 214491 . - . gene_id "TraesCS3B02G000100"; gene_source "IWGSC"; gene_biotype "protein_coding"; 3B IWGSC transcript 212892 214491 . - . gene_id "TraesCS3B02G000100"; transcript_id "TraesCS3B02G000100.1"; gene_source "IWGSC"; gene_biotype "protein_coding"; transcript_source "IWGSC"; transcript_biotype "protein_coding";

ADD REPLY
0
Entering edit mode

Before to perform any bioinformatic analysis I would recommend you:

Be sure what are you looking for 
It is possible to get by in silico analysis?
what tools are available to do it? and how it works?
Do I need professional help?

Sorry but it looks like you do not have idea what are you doing, read about gtf file format. If you perform Stringtie analysis it searchsfor transcripts (column 3), and therefore, if you do not have annotated transcripts or your reads maps to not annotated regions you will get a warning like....

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences.
ADD REPLY

Login before adding your answer.

Traffic: 1561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6