random sampling genome.fa
1
0
Entering edit mode
5.6 years ago
Chironex ▴ 50

I have a genome of about 2 gb composed by scaffolds I would random sample the genome.

I used reformat.sh but the output was only a scaffold. I need 1/3 of the total genome... I report only some scaffolds as example:

>LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence
GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG
>KQ415657.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 unplaced genomic scaffold Scaffold5, whole genome shotgun sequence
tatatatatatagtcaattcgagGATGTTAGATCGACAATGGGGATTATAGAATCCCACAAAAAATTCCACTGGT
>LGKD01000032.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold12_contig_1, whole genome shotgun sequence
GAAGTGGTAAAGAGTgcgatgcgctgaaaaaagagagaacagtacttgaaatGTGGTTTCATTCTagtagtaaat
>LGKD01000033.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold16_contig_1, whole genome shotgun sequence
ctgaTCAACAGAatagggccaatcattcttcatgacaatgctcgaccacacgttttaCTAATGA
>LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence
TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt

There is a script able to do this?

genome fasta • 2.3k views
ADD COMMENT
0
Entering edit mode

You would need to collapse the entire genome sequence into one long fasta and then sample if you truly need 1/3 of total.

Edit: Can you provide exact command line you used?

ADD REPLY
0
Entering edit mode
reformat.sh in=my file.fa ou=newfile.fa samplerate=0.3
ADD REPLY
0
Entering edit mode
5.6 years ago

Something like this would do:

$ NSAMPLE=2 # Number of sequences you want
$ cat genome.fa  | tr '\n' '\t' | tr '>' '\n' | grep -v "^$" | sort -R | head -n$NSAMPLE | sed -e 's|^|>|g'  | tr '\t' '\n' | grep -v "^$" 
>LGKD01000001.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold4_contig_1, whole genome shotgun sequence
GAACAGCATGAATGTTAAAACtgaaatggatgatgatgatgatgatgatgatgatgatggcagcaacAGCCatgattatatttaatatgttgttagttataatcataataatgatgataatgttgataacaaTAATGGTTGCAATAATG
>LGKD01000034.1 Octopus bimaculoides isolate UCB-OBI-ISO-001 Scaffold22_contig_1, whole genome shotgun sequence
TTATCTATATACGagaatattatctatatataaaggaataccaaaaaaacaagaacaacgggtcattcggaattttcttt

The idea is to convert every sequence as one line including the header and the sequence itself, separated by tabs. sort -R gives you a random order, and head -n returns a certain number of sequences (i.e. samples since your input is shuffled) and the rest just converts it back to the format it had before (convert tabs back to new lines.) You can figure out the NSAMPLES to get 1/3 of the total by doing grep -c ">" genome.fa and dividing that number by 3.

ADD COMMENT
0
Entering edit mode

I have a feeling OP is not looking to get a third of fasta sequence entries but rather a third of the actual sequence. It is kind of an odd request if that is true. Perhaps OP will clarify.

ADD REPLY
0
Entering edit mode

Oh interesting. Possible solution for something like that then:

# Randomly get 1/3 of each non-header sequence
awk ' BEGIN{ srand(); }; /^[^>]/ {third=(length($1)/3); offset=(rand() * (third * 2) ); $1=substr($1,offset, offset + third) } 1' genome.fa
ADD REPLY
0
Entering edit mode

Yes, I need only a part of the fasta file. Infact I'm using RepeatScout to find ripetitive elemnts, but the genome is 2 gb and it doesn't work with such a big file. So, the idea is to obtain only a part of the genome because ripetitive sequences are interspersed in the genome

ADD REPLY
0
Entering edit mode

Yes, I need only a part of the fasta file.

I don't think that is right. You seem to have a multi-fasta file so you need as many entries as needed to make roughly third of total (e.g. if you had 15 fasta entries then you would make 5 x 3 files). That is going to be complicated by fact that the sequences are not of equal length, correct?

Why not approximately cut the file into three (or how many ever) pieces?

ADD REPLY
0
Entering edit mode

Hi genomax, I'm the user fripeki but I could not post anymore with that account (reaching max limit of 5 post for day). I Appreciated your idea. Can you suggest me a command to cut the file into 3 approximately equal parts? Sorry I'm learning bioinformatic so I don't know all the solutions

ADD REPLY
1
Entering edit mode

faSplit utility from Jim Kent will also work (linux version linked, do chmod a+x faSplit after downloading). This will be a clean solution compared to split but it may not make the files exactly equal in size (depending on what you have in your files). Change number 3 to desired pieces.

$ ./faSplit sequence genome.fa 3 splt_

this will generate files

splt_0.fa
splt_1.fa
splt_2.fa
ADD REPLY
0
Entering edit mode

ok! thank you very much for your help!

ADD REPLY
0
Entering edit mode

You can use the unix command split to split the files. Adjust the number 3 to suit your purpose.

# Following will split the file into three equal pieces. Split files will get names with prefix (spl_)

$ split -n 3 genome.fa spl_

# this will result in three equal size pieces

spl_aa 
spl_ab
spl_ac

This will most likely split the file in a way that the second and third files will not start with a valid fasta header. You can look at the file 2 and 3 by

$ head -3 splt_ab

to confirm this.

You can insert a valid fasta header by doing the following for file 2 (and 3 etc)

$ sed -i '1 i\>begin_file2' spl_ab
$ sed -i '1 i\>begin_file3' spl_ac
ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6