I have files of data that have Many of the columns but i need sequence of DNA ,I want to convert them to fasta then convert them to fastq or convert it directly to fastq , this files reach to gigabyte, what I must do ????
This is the data we have (tsv format) we need to convert it to FastQ this is screenshot of it
Based on your Excel table, i'm very concerned that you are trying to do something that you probably shouldn't, and if we knew more about where this data came from and what you want to do with it, we'd be able to help you in other ways than fasta -> fastq.
So, I don't think your Excel table is in a common format (that i know of) that can be converted directly into FASTA, so that will be your biggest problem. You may have to cut out just the sequences you want to use from Excel - assuming you know what each of the allele columns means because i dont - and save them as a plain .txt. Then, perhaps someone can write you a script to convert that into FASTA.
However, one thing no one can help you do is convert FASTA to FASTQ. The FASTQ is the FASTA + the quality score from the sequencer. That sequencing-quality information doesnt seem to be in your Excel table, and so I don't think that will be possible at all :(
we downloaded the data from completegenomics website( Cancer Data Set) .
we need to do alignment on the dna sequences using sparkbwa so we need to have the data in fastq format then using filters to get variant and mutations , that is my graduation project what i must do ??? :(
For some reason I can't see the examples you posted. But if you downloaded data from Complete Genomics, aren't those already variant calls in tabular format?
Please copy and paste a couple of example rows (change the sequence/identifiable info if you must) here. It is hard to see anything from the screenshot you have shared.
I'm trying to find out which file you actually downloaded, could you share the path on the ftp site or the complete filename. I have the idea that I'm looking at something about structural variants.
Right, evidence intervals. Have my doubts you can convert this to fast(a/q). (And definitely my doubts whether it's meaningful.)
If I'm not terribly mistaken these intervals just describe the alleles at each variant site, not something you can directly use to reproduce the alignment. Could you expand on what exactly you would like to achieve, and why?
Note that you can convert evidenceDnbs-type files to sam files (showing alignment on variant sites) using cgatool evidence2sam. Would this help you?
At first, We needed to get data as DNA sequences of people infected with Cancer and normal people to get variants by doing alignment then compare the normal variants with the variants of infected people to get Cancer variants.
So we need to get the sequences from this data to do alignment and the remaining variant discovery process, but our background in bioinformatics isn't good as our major is computer engineering. This is our graduation project.
If this is data from complete genomics then you would have to get the fastq or bam from there (I don't have experience with CG but I assume this may be proprietary data).
If you are looking for public Cancer data then TCGA data portal and/or ICGC portal would be your options.
That appears to be just metadata table. You may need to apply for access to get the TCGA data.
If you just want some cancer data then there are plenty of options in EBI-ENA. You will be able to find the fastq files (no aligned files here you will have to create them yourself) once you drill down to samples.
You may also want to take a look at Cancer cell line data here. This does not require authorization.
For those who may happen to reach this thread by way of search in future you can convert a fasta file to fastq format using reformat.sh from BBMap suite.
Please remember that the Q-scores created here are fake (example below sets Q-scores to 35 for all bases).
this is unclear
please post the formatting of the data as you currently have it
This is the data we have (tsv format) we need to convert it to FastQ that screenshot of it
Based on your Excel table, i'm very concerned that you are trying to do something that you probably shouldn't, and if we knew more about where this data came from and what you want to do with it, we'd be able to help you in other ways than fasta -> fastq.
So, I don't think your Excel table is in a common format (that i know of) that can be converted directly into FASTA, so that will be your biggest problem. You may have to cut out just the sequences you want to use from Excel - assuming you know what each of the allele columns means because i dont - and save them as a plain .txt. Then, perhaps someone can write you a script to convert that into FASTA.
However, one thing no one can help you do is convert FASTA to FASTQ. The FASTQ is the FASTA + the quality score from the sequencer. That sequencing-quality information doesnt seem to be in your Excel table, and so I don't think that will be possible at all :(
we downloaded the data from completegenomics website( Cancer Data Set) . we need to do alignment on the dna sequences using sparkbwa so we need to have the data in fastq format then using filters to get variant and mutations , that is my graduation project what i must do ??? :(
For some reason I can't see the examples you posted. But if you downloaded data from Complete Genomics, aren't those already variant calls in tabular format?
You should be able to see the shared image from google drive now (check a few posts up).
no the variant calls are in another file, this file is supposed to contain the sequence
Please copy and paste a couple of example rows (change the sequence/identifiable info if you must) here. It is hard to see anything from the screenshot you have shared.
I'm trying to find out which file you actually downloaded, could you share the path on the ftp site or the complete filename. I have the idea that I'm looking at something about structural variants.
this is the complete file name <evidenceintervals-chr1-hcc1187-h-200-36-asm-n1> the ftp site isn't working since some days.
Right, evidence intervals. Have my doubts you can convert this to fast(a/q). (And definitely my doubts whether it's meaningful.) If I'm not terribly mistaken these intervals just describe the alleles at each variant site, not something you can directly use to reproduce the alignment. Could you expand on what exactly you would like to achieve, and why?
Note that you can convert evidenceDnbs-type files to sam files (showing alignment on variant sites) using cgatool evidence2sam. Would this help you?
At first, We needed to get data as DNA sequences of people infected with Cancer and normal people to get variants by doing alignment then compare the normal variants with the variants of infected people to get Cancer variants. So we need to get the sequences from this data to do alignment and the remaining variant discovery process, but our background in bioinformatics isn't good as our major is computer engineering. This is our graduation project.
If you only need to do alignments then you don't need to convert data to fastq format.
we work in sparkBWA so we need fastq . we need data of cancer fastq or bam where can we get it ????
If this is data from complete genomics then you would have to get the fastq or bam from there (I don't have experience with CG but I assume this may be proprietary data).
If you are looking for public Cancer data then TCGA data portal and/or ICGC portal would be your options.
I couldn't deal with this sites , and the files I have downloaded are in strange format I couldn't find dna sequence .. here is the data I have found
That appears to be just metadata table. You may need to apply for access to get the TCGA data.
If you just want some cancer data then there are plenty of options in EBI-ENA. You will be able to find the fastq files (no aligned files here you will have to create them yourself) once you drill down to samples.
You may also want to take a look at Cancer cell line data here. This does not require authorization.