fastq2fasta by sed
I've been having fun with sed this morning. I'm testing using Illumina data as input to Newbler, just to see how it works. As the data is paired-end, it requires some pre-processing following the guidelines for Sanger reads in the manual for gsAssembler (A.K.A newbler). Yesterday I did this using the method proposed in the manual with a final cleanup using sed so the procedure was fastq-->fasta-->perl script to make .acc file --> use fnafile to produce fasta file --> final tidy using sed. That was all messy and horribly slow. So today I decided to do all of those steps using sed.
Here's the end result:
1) Convert fastq2fasta using sed
sed '/^@/!d;s//>/;N' asp5_leaf_read1.fq > asp5_leaf_read1.fna
sed '/^@/!d;s//>/;N' asp5_leaf_read2.fq > asp5_leaf_read2.fna
2) Format fasta header so newbler can match the pairs up
sed -e 's/>\(.*\)#0\/\(.*\)/>\1 template=\1 dir=\2 library=asp5/' -e 's/dir=1/dir=f/' asp5_leaf_read1.fna > asp5_leaf_read1_newbler.fna
sed -e 's/>\(.*\)#0\/\(.*\)/>\1 template=\1 dir=\2 library=asp5/' -e 's/dir=2/dir=r/' asp5_leaf_read2.fna > asp5_leaf_read1_newbler.fna
3) Produce qual files with read names matching the fasta files
sed '/^+/!d;s//>/;N' asp5_leaf_read1.fq | sed 's/>\(.*\)#0\/\(.*\)/>\1/' > asp5_leaf_read1_newbler.qual
sed '/^+/!d;s//>/;N' asp5_leaf_read2.fq | sed 's/>\(.*\)#0\/\(.*\)/>\1/' > asp5_leaf_read2_newbler.qual
Those qual values will then need rescaling to phred33 .... still to do. Steps 1 and 2 can also be piped if you don't want to keep the unmodified .fna files (but I did - for testing Inchworm). I also need to check whether to reverse compliment the reads for either paired-end or mate-pair data.
Those steps took about 15 mins to complete. The attempt yesterday took over six hours. Thank you sed!
相关推荐:
- NCBI在线BLAST使用方法与结果详解 2941
- 神经网络术语:Epoch、Batch Size和迭代 527
- Consed的安装与使用教程 465
- 陈连福的NGS生物信息学培训教材V2.1 277
- WGCNA分析使用教程 272
-
***As for the first question, 1970-01-01 08:00#2
最新创建圈子
-
原料药研发及国内外注册申报
2019-01-25 10:41圈主:caolianhui 帖子:33 -
制药工程交流
2019-01-25 10:40圈主:polysciences 帖子:30 -
健康管理
2019-01-25 10:40圈主:neuromics 帖子:20 -
发酵技术
2019-01-25 10:39圈主:fitzgerald 帖子:17 -
医学肿瘤学临床试验
2019-01-25 10:39圈主:bma 帖子:58
hi, ‘asp5′ and ‘leaf’ is just a filename without special means.