Tweaking my answer from Genomax's linked thread:
#!/bin/bash
while read line ; do
if [ ${line:0:1} == ">" ] ; then
filename=$(echo "$line" | cut -d ":" -f1 | tr -d ">")
touch ./"$filename".fasta
echo "$line" >> ./"${filename}".fasta
else
echo "$line" >> ./"${filename}".fasta
fi
done < $1
Input:
$ cat test.fa
>chr1:3054106-3054192
ATTCAGCTGAGAAAAATGCAGACCATAAAAAAGCAGCAGGCACTGCTCGAAACTAGTAGCAATGTAGACGAGATGATAGTACTTAA
>chr1:3054204-3054296
TGAAGTATCTGAAGACTTGACGACATGGGAAGACCCACTACTAAATGAAGGAAGTCTGAGGAAAAACAAATCTTCAGCACGCTGGAGAAAGT
>chr1:3054311-3054381
ATGAAAACATAGATGCTGTGTCCTGGGAAAAAAGGAGGAAAATAACAAAGCTGCCAAAATATCTCGTGAG
>chr1:3054409-3054509
AGAACTAACTGTACTGAGAAAAGAAAATTCCGCTTTAAAAGCTGAACTGTTTTTGCTAAAATTAAAGTTTGGTTTAATTAGCTCCATAGCATATGCTCAG
>chrX:119973608-119973699
AAGCATCACTCCGAGAATAAGGGCCTAGACAAAGTGGTGGAGACTCAAGCCCAAGTGGATGAACTGAAAGGAATCATGGTCAGAAACATAG
>chrX:119980524-119980592
ATCTTGTAGCTCAGCGTGGAGAAAGATTGGAATTATTGATAGACAAAACAGAAAACCTCGTGGATTCA
>chrX:119989360-119989453
TCTGTCACCTTCAAAACCACCAGCAGAAACCTTGCCCGAGCCATGTGTATGAAGAATCTCAAGCTCACTATTATCATCATCATTGTATCAATT
>chrX:119993756-119993825
GTGTTCATCTATATCATTGTCTCTCCTGCCTGTGGTGGATTTACATGGCCAAGCTGTATAAAGAAATGA
This creates a new file in the current dir for every sequence which has >chr#
at the start:
$ bash splitfasta.sh test.fa
$ ls chr*
chr1.fasta chrX.fasta
Then,
$ cat chr1.fasta
yields:
>chr1:3054106-3054192
ATTCAGCTGAGAAAAATGCAGACCATAAAAAAGCAGCAGGCACTGCTCGAAACTAGTAGCAATGTAGACGAGATGATAGTACTTAA
>chr1:3054204-3054296
TGAAGTATCTGAAGACTTGACGACATGGGAAGACCCACTACTAAATGAAGGAAGTCTGAGGAAAAACAAATCTTCAGCACGCTGGAGAAAGT
>chr1:3054311-3054381
ATGAAAACATAGATGCTGTGTCCTGGGAAAAAAGGAGGAAAATAACAAAGCTGCCAAAATATCTCGTGAG
>chr1:3054409-3054509
AGAACTAACTGTACTGAGAAAAGAAAATTCCGCTTTAAAAGCTGAACTGTTTTTGCTAAAATTAAAGTTTGGTTTAATTAGCTCCATAGCATATGCTCAG
and $ cat chrX.fasta
yields:
>chrX:119973608-119973699
AAGCATCACTCCGAGAATAAGGGCCTAGACAAAGTGGTGGAGACTCAAGCCCAAGTGGATGAACTGAAAGGAATCATGGTCAGAAACATAG
>chrX:119980524-119980592
ATCTTGTAGCTCAGCGTGGAGAAAGATTGGAATTATTGATAGACAAAACAGAAAACCTCGTGGATTCA
>chrX:119989360-119989453
TCTGTCACCTTCAAAACCACCAGCAGAAACCTTGCCCGAGCCATGTGTATGAAGAATCTCAAGCTCACTATTATCATCATCATTGTATCAATT
>chrX:119993756-119993825
GTGTTCATCTATATCATTGTCTCTCCTGCCTGTGGTGGATTTACATGGCCAAGCTGTATAAAGAAATGA
Disclaimer:
This approach requires linearised sequences (it looks like you already have this, so no need to worry, but incase people in future try to use this code). Linearisation can be achieved with the following code:
# Usage: bash scriptname.sh wrapped.fasta > linear.fasta
while read line; do
if [ "${line:0:1}" == ">" ]; then
echo -e "\n"$line;
else
echo $line | tr -d '\n';
fi;
done < $1 | tail -n+2
Or as a one-liner (replace infile.fa and outfile.fa to suit)
cat infile.fasta | while read line ; do if [ "${line:0:1}" == ">" ]; then echo -e "\n"$line ; else echo $line | tr -d '\n' ; fi ; done | tail -n+2 > outfile.fa
Split the multiple sequences file into a separate files and plenty of other solutions.
I would personally recommend
faSplit
from Jim Kent's (UCSC) utils. Unlikely that there is a more performant solution.Edit: On a closer read you want to split into chunks at chromosome level. Take a look at thread above to see if you can modify one of the answers. We will look into this some more.
What have you tried? This is something that could be done in biopython.
I don't have much programming knowledge, quite new to NGS data analysis.
I tried it
It splits each contig in to a separate file. (I don't want this) can It group all sequences starting with chr1, chr2... in to separate individual chromosome files ??
suggestions please!
None of us was born with programming skills, this would be a fine practice exercise. The first script you write is going to take a long time and will be painful. It will be easier for every script you write afterward.
Please take a look at this code I wrote, it is not that far from what you want to achieve. Try to catch how it works and change some lines
A: fastq parsing inefficiencies with python
sunnykevin97 : You can cheat a bit.
Then
cat
the files per chromosome together.cat chr1:*.fa > chr1_final.fa
throughcat chrMT:*.fa > chrMT_final.fa
Exactly I did the same this.
As a general hint, related to Genomax's other comment. If you're already achieved splitting all of the sequences in to individual files, you can put them back together again using
cat
For instance if you have:
You can just do
cat chr1:*.fasta > chr1.fasta
,cat chrX:*.fasta > chrX.fasta
and so on. If you have lots to do, you can work out how to incorporate this in to a loop, or parallel process.