糜子基因组组装手记(PacBio+NGS+HiC) 2019-04-15 技术 暂无评论 3271 次阅读 相关结果已经发表在: > Changsong Zou, Leiting Li, Daisuke Miki, Delin Li, **Qiming Tang**, Lihong Xiao, Santosh Rajput, Ping Deng, Li Peng, Wei Jia, Ru Huang, Meiling Zhang, Yidan Sun, Jiamin Hu, Xing Fu, Patrick S. Schnable , Yuxiao Chang, Feng Li, Hui Zhang, Baili Feng, Xinguang Zhu, Renyi Liu , James C. Schnable, Jian-Kang Zhu, Heng Zhang [The genome of broomcorn millet](https://www.nature.com/articles/s41467-019-08409-5 "The genome of broomcorn millet"). ***Nature Communications***. 2019 ------------ ### **CONTENT** * [1. Genome Assembly I (*de novo*, PacBio®)](#1) * [1.1 Sequencing QC](#1.1) * [1.2 *k*-mer Statistics](#1.2) * [1.3 Usage of CANU](#1.3) * [1.4 Usage of Pilon](#1.4) * [1.5 Evaluate the Assembly](#1.5) * [2. Genome Assembly II (BioNano®, Hi-C)](#2) * [2.1 Construct BioNano® Scaffold](#2.1) * [2.2 Chromosome-level Assmebly using Hi-C](#2.2) * [3. Genome Annotation](#3) * [3.1 Eukaryotes Genome(*maker*)](#3.1) * [3.2 Prokaryotes Genome(*prokka*)](#3.2) --- 1. Genome Assembly I (de novo, PacBio®) 1.1 Sequencing QC > The script `sequence_statistics.py` could make a basic statitics on the sequencing data, include GC%, N50, mean and longest read's length, etc. * Locate: `/home/qmtang/sequence_statistics.py` * Require: `python 2.7.x` with `numpy` package * Usage: First, add your sample name by modify the script: ```python name = '' ``` then run the script: python sequence_statistics.py .txt In `filespath.txt`, each line contains a **fasta** file's path(absolute path recommonded, not support fastq file). For example, to get all fasta file of a PacBio® run, we could run the command: ``` $ ls */Analysis_Results/*.fasta > .txt ``` * Output: script would print the amount of reads, A, T, C and G separately, and `_smry.txt` contains all statistical results, `_ori.txt` record the sorted read length. We could use the data in `_ori.txt` to plot the firgure of reads length with Excel, R, Python or any statistical tools. 1.2 k-mer Statistics > * We use `GCE(genomic charactor estimator)` program to estimate the genome size. This program could also estimate the genomic repeat content and the heterozygsis rate of the sequencing sample. > * Reference paper could be found at: Binghang Liu, et al. [Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects](https://arxiv.org/abs/1308.2012) * Locate: `/home/qmtang/Programs/gce-1.0.0/` * Require: maybe need `g++` to recompile the soruce code. * Usage&Output: please refer to offical manual: `./README.txt` 1.3 Usage of CANU > * We use `canu` program to assemble the PacBio® sequencing data due to its outstanding performance in our previous tests. > * Github: [http://github.com/marbl/canu/](http://github.com/marbl/canu/) > * Citation: Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. [Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation](https://doi.org/10.1101/gr.215087.116). Genome Research. (2017). * Locate: `/home/qmtang/Programs/canu1.3/`, `/home/qmtang/Programs/canu-1.5/`, `/home/qmtang/Programs/canu-1.6/`. There 3 versions used in previous projects, of course the latest is better if you don't need to consider the **repeatability** of projects. * Require: *Nothing special needs to be prepared.* * Usage: * Launch to SGE: ``` /home/qmtang/Programs/canu \ gnuplotTested=true \ gridOptions="-q small.q -S /bin/bash" \ gridEngineThreadsOption="-pe orte THREADS" \ gridEngineMemoryOption="-l mem_free=MEMORY" \ -p -d genomeSize= \ -pacbio-raw \ .fastq)> ``` * CANU will auto detect the cluster's hardware configuration and submit jobs to cluster. For small genome, there is little need for subsequent adjustments. While if the genome size is large, the jobs could failed because the insufficient memory, and manual adjustment is acquired to re-submit jobs. * If canu is unexpected aborted, there is a way to restart the pipeline: in folder `./canu-scripts/`, there would lots of scripts named as `canu.xx.sh` and `xx` is a number. If the maximum is `MAX`, you could restart the pipeline with re-run the script `./canu-scripts/canu..sh` at ``. 1.4 Usage of Pilon > * We use Pilon to correct the PacBio® assembly with Illumina® high quality paired reads. > * HomePage: [http://software.broadinstitute.org/software/pilon/](http://software.broadinstitute.org/software/pilon/) > * Citation: Bruce J. Walker, et al. [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112963) PLoS ONE (2014) * Loacte: `/home/qmtang/Programs/pilon-1.18.jar` * Require: Java runtime > 1.7 * Usage (based on offical [wiki](https://github.com/broadinstitute/pilon/wiki/Requirements-&-Usage): * Index build: ``` bowtie2-build --threads .fasta ``` * Mapping: ``` lane=(1,2,3,4) for ID in lane do bowtie2 -p --reorder --mm -x -1 /path/to/NGS/_L${ID}_R1_paired.fastq.gz -2 _L${ID}_R2_paired.fastq.gz -S _L${ID}_paired.sam samtools view -b -S _L${ID}_paired.sam > _L${ID}_paired.bam samtools sort _L${ID}_paired.bam -o _L${ID}_sorted.bam -T ./L${ID} --threads samtools index _L${ID}_sorted.bam samtools idxstats _L${ID}_sorted.bam > _L${ID}_sorted.bam.idxstats done ``` * Run Pilon:(Recommand running on biocomp1) ``` java -Xmx1024G -jar pilon-1.18.jar \ --genome .fasta \ --frags _L1_sorted.bam \ --frags _L2_sorted.bam \ --frags _L3_sorted.bam \ --frags _L4_sorted.bam \ --threads --changes \ --outdir \ --output _pilon ``` * Output: * `/_pilon.changes` recorded all changes during pilon pipeline. We can count the base error rate though this file. * `/_pilon.fasta` is the final corrected assembly. 1.5 Evaluate the Assembly > Scripts `sequence_statistics.py` and `sequence_print.py` could be used to evaluate the assembly briefly. * Locate: `/home/qmtang/sequence_statistics.py`, `/home/qmtang/sequence_print.py` * Require: `python 2.7.x` with `numpy` package * Usage: * the usage of `sequence_statistics.py` please refer to [1.1 Sequencing QC](#1.1); * the usage of `sequence_print.py` is same as `sequence_statistics.py`, but the output is different. `sequence_print.py` would generate `_orisc.txt` containing each sequence(contig)'s name and its length. * Both two scripts could be modified by replace: ```python with open(name+'.txt') as file: for line in file: src += [line.strip('\n')] ``` with: ```python src=[sys.argv[1]] name = sys.argv[1].split('/')[-1].split('.')[0] ``` to suit for reading single one fasta file though command line directly. python sequence_statistics.py .fasta --- 2. Genome Assembly II (BioNano®, Hi-C) 2.1 Construct BioNano® Scaffold > * BioNano® offers a new optical genetic mapping technology to construct genome hybrid scaffolds. > * Homepage: [https://bionanogenomics.com/](https://bionanogenomics.com/) refer to: ![IrysView_Software_Training_Guide_24.pdf-3920.9kB][1] 2.2 Chromosome-level Assmebly using Hi-C > * The old Hi-C assmebly pipeline is based on LACHESIS (`/home/qmtang/Programs/LACHESIS/`), while this program is no longer to update and the lastest version has bug in compiling. So a new assembly pipeline is recorded here instead of LACHESIS, which developed by same lab. > * Github: [https://github.com/theaidenlab/juicer](https://github.com/theaidenlab/juicer) and [https://github.com/theaidenlab/3d-dna](https://github.com/theaidenlab/3d-dna) * Locate: `/home/qmtang/Millet/HiC-new/software/` * Require: Python 2.7 with `scipy` package * Usage: * [Juicer] build genome reference.(refer to [https://github.com/theaidenlab/juicer/wiki/Usage#adding-a-new-genome](https://github.com/theaidenlab/juicer/wiki/Usage#adding-a-new-genome) * [Juicer] execute the juicer pipeline.(refer to [https://github.com/theaidenlab/juicer/wiki/Running-Juicer-on-a-cluster](https://github.com/theaidenlab/juicer/wiki/Running-Juicer-on-a-cluster)): ``` /home/qmtang/Millet/HiC-new/software/juicer/CPU/juicer.sh -t 32 \ -D /home/qmtang/Millet/HiC-new -g Millet \ -d /home/qmtang/Millet/HiC-new -s "HindIII" \ -y /home/qmtang/Millet/HiC-new/restriction_sites/Millet_HindIII.txt \ -z /home/qmtang/Millet/HiC-new/references/millet_scaffolds.fasta \ -p /home/qmtang/Millet/HiC-new/restriction_sites/Millet.chrom.sizes ``` * [3d-dna] execute the 3d-dna pipeline(refer to [https://github.com/theaidenlab/3d-dna](https://github.com/theaidenlab/3d-dna)): ``` /home/qmtang/Millet/HiC-new/software/3d-dna/run-pipeline.sh \ -c 18 \ /home/qmtang/Millet/HiC-new/references/millet_scaffolds.fasta \ /home/qmtang/Millet/HiC-new/aligned/merged_nodups.txt ``` --- 3. Genome Annotation 3.1 Eukaryotes Genome(maker) > * To annotate eukaryotes genome, servals published pipeline used in our previous projects, including `TEdenovo`, `TEannot` and `maker`. > * REPET pipeline(`TEdenovo` and `TEannot`) refer to: [http://urgi.versailles.inra.fr/Tools/REPET](http://urgi.versailles.inra.fr/Tools/REPET) > * maker pipeline refer to [http://www.yandell-lab.org/software/maker.html](http://www.yandell-lab.org/software/maker.html) * Loacte: REPET:`/home/qmtang/Programs/REPET-2.5/` * Usage: all files locate: `/home/qmtang/MilletCHP_Anno/` * STEP 1. TEdenovo(`/home/qmtang/MilletCHP_Anno/01.TEdenovo/`). Execute the commands below one by one(logged in `/home/qmtang/MilletCHP_Anno/01.TEdenovo/0.Repet_deNovo.sh`): ``` cd ~/MilletCHP_Anno/01.TEdenovo TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 1 TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 2 -s Blaster TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 2 --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Grouper TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Recon TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Piler TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 --struct -m Map LaunchRepeatScout.py -i MilletCHP.fa TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f Blastclust --struct TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL --struct ``` * STEP 2. TEannot(Build a ref TE library with simplified pipeline, `/home/qmtang/MilletCHP_Anno/02.TEanno/`):Execute the commands below one by one(logged in `/home/qmtang/MilletCHP_Anno/02.TEanno/0.TEanno.sh` and `/home/qmtang/MilletCHP_Anno/02.TEanno/step7+.sh`): ``` cd /home/qmtang/MilletCHP_Anno/02.TEanno ########build a ref TE library###### TEannot.py -P MilletCHP -C TEannot.cfg -S 1 TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR -r TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM -r TEannot.py -P MilletCHP -C TEannot.cfg -S 3 -c BLR+RM TEannot.py -P MilletCHP -C TEannot.cfg -S 7 PostAnalyzeTELib.py -a 3 -p MilletCHP_chr_allTEs_nr_noSSR_join_path -s MilletCHP_refTEs_seq -g 907084234 GetSpecificTELibAccordingToAnnotation.py -i MilletCHP_chr_allTEs_nr_join_path_statsPerTE.txt -t MilletCHP_refTEs_seq ln -s MilletCHP_chr_allTEs_nr_join_path_statsPerTE_FullLengthFrag.fa /home/qmtang/MilletCHP_Anno/04.re-TEanno/MilletCHP_refTEs.fa ``` * STEP 3: maker(Gene annotation, `/home/qmtang/MilletCHP_Anno/03-1024.Maker/` and `/home/qmtang/MilletCHP_Anno/03.Maker/`): * `/home/qmtang/MilletCHP_Anno/03-1024.Maker/pre-maker.py` split the genome into seperate fasta files for each scaffold/contig, then submit the maker jobs to cluster. Make sure `maker.sh`, `maker_bopts.ctl`, `maker_exe.ctl` and `maker_opts.ctl` exist in same folder. * To combine all jobs' output(`.gff`) together, please modify and execute the script `/home/qmtang/MilletCHP_Anno/03.Maker/combine_gff.py`. The input file `gffs_path` could be get though `ls $i/*/*/*/*/*/*gff > `. * STEP 4: TEanno(Full pipeline to get TE annotation, `/home/qmtang/MilletCHP_Anno/04.re-TEanno/`): ``` TEannot.py -P MilletCHP -C TEannot.cfg -S 1 TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR -r TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM -r TEannot.py -P MilletCHP -C TEannot.cfg -S 3 -c BLR+RM TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s TRF TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s Mreps TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s RMSSR TEannot.py -P MilletCHP -C TEannot.cfg -S 5 TEannot.py -P MilletCHP -C TEannot.cfg -S 6 -b tblastx TEannot.py -P MilletCHP -C TEannot.cfg -S 6 -b blastx TEannot.py -P MilletCHP -C TEannot.cfg -S 7 TEannot.py -P MilletCHP -C TEannot.cfg -S 8 -o GFF3 ``` * STEP 5: t-RNA and r-RNA annotation * t-RNA refer to `/home/qmtang/Programs/tRNAscan-SE-1.3.1/` * r-RNA genes annotated by mapping *Arabidopsis thaliana*(TAIR10) rRNA gene sequence to genome.(`/home/qmtang/MilletCHP_Anno/07.rRNA/`) 3.2 Prokaryotes Genome(prokka) > * We use prokka to annotate prokaryotes genome. > * Github: [https://github.com/tseemann/prokka](https://github.com/tseemann/prokka) > * Citation: Seemann T. [Prokka: rapid prokaryotic genome annotation](http://www.ncbi.nlm.nih.gov/pubmed/24642063), Bioinformatics, 2014 * Locate: `/home/qmtang/Programs/prokka/` * Requrie: refer to `/home/qmtang/Programs/prokka/README.md` * Usage: ``` /home/qmtang/Programs/prokka/bin/prokka --cpus 0 \ --outdir /home/qmtang/Bacillus/prokka-TG1_E1/prokka_modify --prefix TG1-E1 \ --addgenes --rfam \ --gram + \ --genus Bacillus_TG1-E1 \ --species Bacillus \ --strain TG1-E1 \ --proteins /home/qmtang/Bacillus/ref_ydsun/prot2003-2014.fa \ --hmms /home/qmtang/Bacillus/Pfam31.0-A.hmm \ /home/qmtang/Bacillus/prokka-TG1_E1/TG1-E1_correctedGenome.fasta ``` * Output: refer to [https://github.com/tseemann/prokka#output-files](https://github.com/tseemann/prokka#output-files) [1]: http://static.zybuluo.com/Syaoran-Tang/hpu6kmys0o3vd6jyjdf92ady/IrysView_Software_Training_Guide_24.pdf 文章目录 1. Genome Assembly I (de novo, PacBio®) 1.1 Sequencing QC 1.2 k-mer Statistics 1.3 Usage of CANU 1.4 Usage of Pilon 1.5 Evaluate the Assembly 2. Genome Assembly II (BioNano®, Hi-C) 2.1 Construct BioNano® Scaffold 2.2 Chromosome-level Assmebly using Hi-C 3. Genome Annotation 3.1 Eukaryotes Genome(maker) 3.2 Prokaryotes Genome(prokka) 标签: 生物信息学 本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。