How to process reverse sequences?
2
0
Entering edit mode
9.3 years ago
fr ▴ 220

Hello!

I'm processing a bunch of .gff and genbank files to extract some intergenic sequences. Some are forward, others are reverse. But I don't know if the reverse sequences extracted should be inverted? Or something else? How can I process them in order to then extract equivalent subsequences between the leading strand and the reverse strand?

(If they were only coding sequences, what would you need to do?

Note, my question is not so much on how to do something, I just don't know what kind of approach is taken in these sequences).

Thanks

genome sequence • 3.0k views
ADD COMMENT
0
Entering edit mode
9.3 years ago
Lesley Sitter ▴ 610

Hi r.b.,

As far as I know those can be considered as downstream from 3' <----- 5'

Ref:                                                                ATGCCCTAGTAACCGGATCCCGTA
upstream gene (ATGCCCTAG)                         ATGCCCTAG -> 
downstream gene (ATGCCCTAG)                                                  <- GATCCCGTA

So then it kinda depends on what you want to do with the intergenic sequences.

If you just want the sequences you can just take Ref[1-9] for gene 1 and Ref[24-15] for gene 2. Extract that from you strands and you should be done. If you want you can add a flag to the header that notifies you that it is downstream, or you could just flip the sequence so that it starts with ATG again.

When I did something like this I just flipped them before outputting it to a fasta. As long as you keep the same names you can always find out what the original orientation of the gene was.

ADD COMMENT
0
Entering edit mode
9.3 years ago
Michael 55k

You need to reverse complement the sequence from the negative strand with respect to the reference genome if extracted by genomic coordinates (e.g. using a gff to extract sequence from a fasta). This doesn't depend on what this sequence region is annotated as. However, coding sequences (and sometimes transcript sequences) are normally given in the 'correct' orientation ('spliced and ready to translate'), so they do not contain intergenic or intronic sequence.

A lot of tools search the reverse-complement in addition to the input by default, so you don't need to bother about this. For sequence similarity searches using e.g. blastn, the reverse complement should be always included.

ADD COMMENT

Login before adding your answer.

Traffic: 1908 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6