Convert from fasta to fastq
2
3
Entering edit mode
8.4 years ago

I have files of data that have Many of the columns but i need sequence of DNA ,I want to convert them to fasta then convert them to fastq or convert it directly to fastq , this files reach to gigabyte, what I must do ???? This is the data we have (tsv format) we need to convert it to FastQ this is screenshot of it

sequence DNA format • 6.0k views
ADD COMMENT
0
Entering edit mode

this is unclear

ADD REPLY
0
Entering edit mode

please post the formatting of the data as you currently have it

ADD REPLY
0
Entering edit mode

This is the data we have (tsv format) we need to convert it to FastQ that screenshot of it

ADD REPLY
0
Entering edit mode

Based on your Excel table, i'm very concerned that you are trying to do something that you probably shouldn't, and if we knew more about where this data came from and what you want to do with it, we'd be able to help you in other ways than fasta -> fastq.

So, I don't think your Excel table is in a common format (that i know of) that can be converted directly into FASTA, so that will be your biggest problem. You may have to cut out just the sequences you want to use from Excel - assuming you know what each of the allele columns means because i dont - and save them as a plain .txt. Then, perhaps someone can write you a script to convert that into FASTA.

However, one thing no one can help you do is convert FASTA to FASTQ. The FASTQ is the FASTA + the quality score from the sequencer. That sequencing-quality information doesnt seem to be in your Excel table, and so I don't think that will be possible at all :(

ADD REPLY
1
Entering edit mode

we downloaded the data from completegenomics website( Cancer Data Set) . we need to do alignment on the dna sequences using sparkbwa so we need to have the data in fastq format then using filters to get variant and mutations , that is my graduation project what i must do ??? :(

ADD REPLY
0
Entering edit mode

For some reason I can't see the examples you posted. But if you downloaded data from Complete Genomics, aren't those already variant calls in tabular format?

ADD REPLY
1
Entering edit mode

You should be able to see the shared image from google drive now (check a few posts up).

ADD REPLY
0
Entering edit mode

no the variant calls are in another file, this file is supposed to contain the sequence

ADD REPLY
1
Entering edit mode

Please copy and paste a couple of example rows (change the sequence/identifiable info if you must) here. It is hard to see anything from the screenshot you have shared.

ADD REPLY
1
Entering edit mode
#ASSEMBLY_ID    HCC1187-H-200-36-ASM-N1                                                         
#CHROMOSOME chr1                                                            
#FORMAT_VERSION 2                                                           
#GENERATED_AT   2012-Jan-14 08:27:41.197287                                                         
#GENERATED_BY   ExportEvidence                                                          
#SAMPLE GS00258-DNA_F01                                                         
#SOFTWARE_VERSION   2.0.2.15                                                            
#TYPE   EVIDENCE-INTERVALS                                                          

>IntervalId Chromosome  OffsetInChromosome  Length  Ploidy  AlleleIndexes   EvidenceScoreVAF    EvidenceScoreEAF    Allele0 Allele1 Allele2 Allele3 Allele1Alignment    Allele2Alignment    Allele3Alignment        
0   chr1    552 55  2   1   2   0   0   CAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGA CAGAGGACAACGCAGCTCCGTCCTCGCGGTGCTCTCCGGGTCTGTGCTGAAGAGA CAGAGGAGAACGCAGCTCCGCCCTCGCGATGCTCTCCGGGTGTGTGCTAAGGAGA     55M 55M     
1   chr1    612 17  2   0   1   0   0   ACTCCGCCGGCGCAGGC   ACTTCACCGGCGCAGGC           17M         
2   chr1    959 41  3   1   2   3   1358    1231    GAAACTCACGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA   GAAACTCACGTCACGGCGGCGCGGCGCAGAGACGGGTGAAA   GAAACTCACGTCACGGCGGCGCGGCGCAGAGACGGGTGGAA   GAACCTCACGTCACGGTGGCGCGGCGCAGAGACGGGTAGAA   41M 41M 41M
ADD REPLY
1
Entering edit mode

I'm trying to find out which file you actually downloaded, could you share the path on the ftp site or the complete filename. I have the idea that I'm looking at something about structural variants.

ADD REPLY
1
Entering edit mode

this is the complete file name <evidenceintervals-chr1-hcc1187-h-200-36-asm-n1> the ftp site isn't working since some days.

ADD REPLY
0
Entering edit mode

Right, evidence intervals. Have my doubts you can convert this to fast(a/q). (And definitely my doubts whether it's meaningful.) If I'm not terribly mistaken these intervals just describe the alleles at each variant site, not something you can directly use to reproduce the alignment. Could you expand on what exactly you would like to achieve, and why?

Note that you can convert evidenceDnbs-type files to sam files (showing alignment on variant sites) using cgatool evidence2sam. Would this help you?

ADD REPLY
1
Entering edit mode

At first, We needed to get data as DNA sequences of people infected with Cancer and normal people to get variants by doing alignment then compare the normal variants with the variants of infected people to get Cancer variants. So we need to get the sequences from this data to do alignment and the remaining variant discovery process, but our background in bioinformatics isn't good as our major is computer engineering. This is our graduation project.

ADD REPLY
0
Entering edit mode

If you only need to do alignments then you don't need to convert data to fastq format.

ADD REPLY
0
Entering edit mode

we work in sparkBWA so we need fastq . we need data of cancer fastq or bam where can we get it ????

ADD REPLY
0
Entering edit mode

If this is data from complete genomics then you would have to get the fastq or bam from there (I don't have experience with CG but I assume this may be proprietary data).

If you are looking for public Cancer data then TCGA data portal and/or ICGC portal would be your options.

ADD REPLY
0
Entering edit mode

I couldn't deal with this sites , and the files I have downloaded are in strange format I couldn't find dna sequence .. here is the data I have found

sample_id   sample_type matched_sample_id   donor_id    diagnosis_id    sex age_at_diagnosis    age_at_recruitment  biobank_id  consent consent_version irb_approval_acquired   last_follow_up_date donor_record_created_date   donor_record_last_update_date   donor_record_release_date   therapy_type    therapy_response    disease_site    tumour_sample_anatomic_location primary_tumour_type primary_metastatic_recurrent    clinical_staging    pre_or_post_tx_collected    tissue_type diagnosis_record_created_date   diagnosis_record_last_update_date   diagnosis_record_release_date   quantity_on_hand    collection_date sample_freezing_method  sample_record_created_date  sample_record_last_update_date  sample_status   optical_image_stained_section   pathological_m  pathological_n  pathological_t  pathology_stage_grouping    percent_intact_tumour_cells storage_medium  tissue_fixation_protocol    
749710  cell line   914566  649719  668681  female      52      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749711  cell line   911798  649720  668682  female      41      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749712  cell line   911799  649721  668683  female      43      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749713  cell line   911800  649722  668684  female      44      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749714  cell line   911801  649723  668685  female      23      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749709  cell line   911802  649718  668680  female      61      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25  2010-02-25                                      
749715  cell line   911803  649724  668686  female      48      Yes     Yes     2010-02-25  2010-02-25              Breast      Breast cancer   primary         malignant   2010-02-25  2010-02-25                  2010-02-25
ADD REPLY
0
Entering edit mode

That appears to be just metadata table. You may need to apply for access to get the TCGA data.

If you just want some cancer data then there are plenty of options in EBI-ENA. You will be able to find the fastq files (no aligned files here you will have to create them yourself) once you drill down to samples.

You may also want to take a look at Cancer cell line data here. This does not require authorization.

ADD REPLY
5
Entering edit mode
8.4 years ago
GenoMax 147k

For those who may happen to reach this thread by way of search in future you can convert a fasta file to fastq format using reformat.sh from BBMap suite.

Please remember that the Q-scores created here are fake (example below sets Q-scores to 35 for all bases).

reformat.sh in=test.fa out=fake.fq qfake=35

ADD COMMENT

Login before adding your answer.

Traffic: 1255 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6