糜子基因组组装手记（PacBio+NGS+HiC）

相关结果已经发表在：
> Changsong Zou, Leiting Li, Daisuke Miki, Delin Li, **Qiming Tang**, Lihong Xiao, Santosh Rajput, Ping Deng, Li Peng, Wei Jia, Ru Huang, Meiling Zhang, Yidan Sun, Jiamin Hu, Xing Fu, Patrick S. Schnable , Yuxiao Chang, Feng Li, Hui Zhang, Baili Feng, Xinguang Zhu, Renyi Liu , James C. Schnable, Jian-Kang Zhu, Heng Zhang [The genome of broomcorn millet](https://www.nature.com/articles/s41467-019-08409-5 "The genome of broomcorn millet"). ***Nature Communications***. 2019

------------

### **CONTENT**

* [1. Genome Assembly I (*de novo*, PacBio<sup>®</sup>)](#1)
    * [1.1 Sequencing QC](#1.1)
    * [1.2 *k*-mer Statistics](#1.2)
    * [1.3 Usage of CANU](#1.3)
    * [1.4 Usage of Pilon](#1.4)
    * [1.5 Evaluate the Assembly](#1.5)

* [2. Genome Assembly II (BioNano<sup>®</sup>, Hi-C)](#2)
    * [2.1 Construct BioNano<sup>®</sup> Scaffold](#2.1)
    * [2.2 Chromosome-level Assmebly using Hi-C](#2.2)
* [3. Genome Annotation](#3)
    * [3.1 Eukaryotes Genome(*maker*)](#3.1)
    * [3.2 Prokaryotes Genome(*prokka*)](#3.2)

---
<h3 id="1"><a class="cl-offset" name="cl-1"></a><b>1. Genome Assembly I (<i>de novo</i>, PacBio<sup>®</sup>)</b></h3>
<h4 id="1.1"><a class="cl-offset" name="cl-2"></a>1.1 Sequencing QC</h4>
> The script `sequence_statistics.py` could make a basic statitics on the sequencing data, include GC%, N50, mean and longest read's length, etc.

* Locate: `/home/qmtang/sequence_statistics.py`
* Require: `python 2.7.x` with `numpy` package
* Usage: 
First, add your sample name by modify the script:
```python
name = '<name>'
```
then run the script:

python sequence_statistics.py <name>.txt
        
In `filespath.txt`, each line contains a **fasta** file's path(absolute path recommonded, not support fastq file). For example, to get all fasta file of a PacBio<sup>®</sup> run, we could run the command:
```
$ ls */Analysis_Results/*.fasta > <name>.txt
```
* Output: script would print the amount of reads, A, T, C and G separately, and `<name>_smry.txt` contains all statistical results, `<name>_ori.txt` record the sorted read length. We could use the data in  `<name>_ori.txt` to plot the firgure of reads length with Excel, R, Python or any statistical tools.

<br />

<h4 id="1.2"><a class="cl-offset" name="cl-3"></a>1.2 <i>k-</i>mer Statistics</h4>
> * We use `GCE(genomic charactor estimator)` program to estimate the genome size. This program could also estimate the genomic repeat content and the heterozygsis rate of the sequencing sample. 
> * Reference paper could be found at: Binghang Liu, et al. [Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects](https://arxiv.org/abs/1308.2012)

* Locate: `/home/qmtang/Programs/gce-1.0.0/`
* Require: maybe need `g++` to recompile the soruce code.
* Usage&Output: please refer to offical manual: `./README.txt`

<br />

<h4 id="1.3"><a class="cl-offset" name="cl-4"></a>1.3 Usage of CANU</h4>
> * We use `canu` program to assemble the PacBio<sup>®</sup> sequencing data due to its outstanding performance in our previous tests.
> * Github: [http://github.com/marbl/canu/](http://github.com/marbl/canu/)
> * Citation: Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. [Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation](https://doi.org/10.1101/gr.215087.116). Genome Research. (2017).

* Locate: `/home/qmtang/Programs/canu1.3/`, `/home/qmtang/Programs/canu-1.5/`, `/home/qmtang/Programs/canu-1.6/`. There 3 versions used in previous projects, of course the latest is better if you don't need to consider the **repeatability** of projects.
* Require: *Nothing special needs to be prepared.*
* Usage: 
 * Launch to SGE:

```
    /home/qmtang/Programs/canu \
        gnuplotTested=true \
        gridOptions="-q small.q -S /bin/bash" \
        gridEngineThreadsOption="-pe orte THREADS" \
        gridEngineMemoryOption="-l mem_free=MEMORY" \
        -p <projectName> -d <workdir> genomeSize=<GenomeSize> \
        -pacbio-raw \
        <All your subreads fastq files path(.subread.<1/2/3>.fastq)>
    ```
    * CANU will auto detect the cluster's hardware configuration and submit jobs to cluster. For small genome, there is little need for subsequent adjustments. While if the genome size is large, the jobs could failed because the insufficient memory, and manual adjustment is acquired to re-submit jobs.
    * If canu is unexpected aborted, there is a way to restart the pipeline: in folder `./canu-scripts/`, there would lots of scripts named as `canu.xx.sh` and `xx` is a number. If the maximum is `MAX`, you could restart the pipeline with re-run the script `./canu-scripts/canu.<MAX-1>.sh` at `<workdir>`.

<br />
<h4 id="1.4"><a class="cl-offset" name="cl-5"></a>1.4 Usage of Pilon</h4>
> * We use Pilon to correct the PacBio<sup>®</sup> assembly with Illumina<sup>®</sup> high quality paired reads.
> * HomePage: [http://software.broadinstitute.org/software/pilon/](http://software.broadinstitute.org/software/pilon/)
> * Citation: Bruce J. Walker, et al. [Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0112963) PLoS ONE (2014)

* Loacte: `/home/qmtang/Programs/pilon-1.18.jar`
* Require: Java runtime > 1.7
* Usage (based on offical [wiki](https://github.com/broadinstitute/pilon/wiki/Requirements-&-Usage):
    * Index build:

```
    bowtie2-build --threads <N> <assmebly_contigs>.fasta <assmebly_name>
    ```
    * Mapping:
    
    ```
lane=(1,2,3,4)
for ID in lane
do
        bowtie2 -p <T> --reorder --mm -x <assmebly_name> -1 /path/to/NGS/<speices>_L${ID}_R1_paired.fastq.gz -2 <speices><speices>_L${ID}_R2_paired.fastq.gz -S <speices>_L${ID}_paired.sam
    samtools view -b -S <speices>_L${ID}_paired.sam > <speices>_L${ID}_paired.bam
    samtools sort <speices>_L${ID}_paired.bam -o <speices>_L${ID}_sorted.bam -T ./L${ID} --threads <T>
        samtools index <speices>_L${ID}_sorted.bam
    samtools idxstats <speices>_L${ID}_sorted.bam > <speices>_L${ID}_sorted.bam.idxstats
done
    ```
    * Run Pilon:(Recommand running on biocomp1)
    
    ```
    java -Xmx1024G -jar pilon-1.18.jar \
	    --genome <assmebly_contigs>.fasta \
	    --frags <speices>_L1_sorted.bam \
	    --frags <speices>_L2_sorted.bam \
	    --frags <speices>_L3_sorted.bam \
	    --frags <speices>_L4_sorted.bam \
	    --threads <T> --changes \
	    --outdir <pilon_workdir> \
        --output <assmebly_contigs>_pilon
    ```

* Output:
    * `<pilon_workdir>/<assmebly_contigs>_pilon.changes` recorded all changes during pilon pipeline.  We can count the base error rate though this file.
    * `<pilon_workdir>/<assmebly_contigs>_pilon.fasta` is the final corrected assembly.
    
<br />
<h4 id="1.5"><a class="cl-offset" name="cl-6"></a>1.5 Evaluate the Assembly</h4>
> Scripts `sequence_statistics.py` and `sequence_print.py` could be used to evaluate the assembly briefly.

* Locate: `/home/qmtang/sequence_statistics.py`, `/home/qmtang/sequence_print.py`
* Require: `python 2.7.x` with `numpy` package
* Usage: 
    * the usage of `sequence_statistics.py` please refer to [1.1 Sequencing QC](#1.1);
    * the usage of  `sequence_print.py` is same as `sequence_statistics.py`, but the output is different. `sequence_print.py` would generate `<name>_orisc.txt` containing each sequence(contig)'s name and its length.
    * Both two scripts could be modified by replace:

```python
    with open(name+'.txt') as file:
	for line in file:
		src += [line.strip('\n')]
    ```
    with:
    ```python
    src=[sys.argv[1]]
    name = sys.argv[1].split('/')[-1].split('.')[0]
    ```
    to suit for reading single one fasta file though command line directly.
    
        python sequence_statistics.py <yoursequence>.fasta
    
---
<h3 id="2"><a class="cl-offset" name="cl-7"></a><b>2. Genome Assembly II (BioNano<sup>®</sup>, Hi-C)</b></h3>
<h4 id="2.1"><a class="cl-offset" name="cl-8"></a>2.1 Construct BioNano<sup>®</sup> Scaffold</h4>
> * BioNano<sup>®</sup> offers a new optical genetic mapping technology to construct genome hybrid scaffolds.
> * Homepage: [https://bionanogenomics.com/](https://bionanogenomics.com/)

refer to： ![IrysView_Software_Training_Guide_24.pdf-3920.9kB][1]

<br />
<h4 id="2.2"><a class="cl-offset" name="cl-9"></a>2.2 Chromosome-level Assmebly using Hi-C</h4>
> * The old Hi-C assmebly pipeline is based on LACHESIS (`/home/qmtang/Programs/LACHESIS/`), while this program is no longer to update and the lastest version has bug in compiling. So a new assembly pipeline is recorded here instead of LACHESIS, which developed by same lab.
> * Github: [https://github.com/theaidenlab/juicer](https://github.com/theaidenlab/juicer) and [https://github.com/theaidenlab/3d-dna](https://github.com/theaidenlab/3d-dna)

* Locate: `/home/qmtang/Millet/HiC-new/software/`
* Require: Python 2.7 with `scipy` package
* Usage:
 * [Juicer] build genome reference.(refer to [https://github.com/theaidenlab/juicer/wiki/Usage#adding-a-new-genome](https://github.com/theaidenlab/juicer/wiki/Usage#adding-a-new-genome)
 * [Juicer] execute the juicer pipeline.(refer to [https://github.com/theaidenlab/juicer/wiki/Running-Juicer-on-a-cluster](https://github.com/theaidenlab/juicer/wiki/Running-Juicer-on-a-cluster)):
 
    ```
    /home/qmtang/Millet/HiC-new/software/juicer/CPU/juicer.sh -t 32 \
    -D /home/qmtang/Millet/HiC-new -g Millet \
    -d /home/qmtang/Millet/HiC-new -s "HindIII" \
    -y /home/qmtang/Millet/HiC-new/restriction_sites/Millet_HindIII.txt \
    -z /home/qmtang/Millet/HiC-new/references/millet_scaffolds.fasta \
    -p /home/qmtang/Millet/HiC-new/restriction_sites/Millet.chrom.sizes
    ```

* [3d-dna] execute the 3d-dna pipeline(refer to [https://github.com/theaidenlab/3d-dna](https://github.com/theaidenlab/3d-dna)):
 
  ```
  /home/qmtang/Millet/HiC-new/software/3d-dna/run-pipeline.sh \
  -c 18 \
  /home/qmtang/Millet/HiC-new/references/millet_scaffolds.fasta \
  /home/qmtang/Millet/HiC-new/aligned/merged_nodups.txt
  ```

---
<h3 id="3"><a class="cl-offset" name="cl-10"></a><b>3. Genome Annotation</b></h3>
<h4 id="3.1"><a class="cl-offset" name="cl-11"></a>3.1 Eukaryotes Genome(<i>maker</i>)</h4>
> * To annotate eukaryotes genome, servals published pipeline used in our previous projects, including `TEdenovo`, `TEannot` and `maker`.
> * REPET pipeline(`TEdenovo` and `TEannot`) refer to:  [http://urgi.versailles.inra.fr/Tools/REPET](http://urgi.versailles.inra.fr/Tools/REPET)
> * maker pipeline refer to [http://www.yandell-lab.org/software/maker.html](http://www.yandell-lab.org/software/maker.html)

* Loacte: REPET:`/home/qmtang/Programs/REPET-2.5/`
* Usage: all files locate: `/home/qmtang/MilletCHP_Anno/`

* STEP 1. TEdenovo(`/home/qmtang/MilletCHP_Anno/01.TEdenovo/`). Execute the commands below one by one(logged in `/home/qmtang/MilletCHP_Anno/01.TEdenovo/0.Repet_deNovo.sh`):

```
cd ~/MilletCHP_Anno/01.TEdenovo
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 1
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 2 -s Blaster
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 2 --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Grouper
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Recon
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 -s Blaster -c Piler
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 3 --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 4 --struct -m Map
LaunchRepeatScout.py -i MilletCHP.fa
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f Blastclust --struct
TEdenovo.py -P MilletCHP -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL --struct
```

* STEP 2. TEannot(Build a ref TE library with simplified pipeline, `/home/qmtang/MilletCHP_Anno/02.TEanno/`):Execute the commands below one by one(logged in `/home/qmtang/MilletCHP_Anno/02.TEanno/0.TEanno.sh` and `/home/qmtang/MilletCHP_Anno/02.TEanno/step7+.sh`):

```
cd /home/qmtang/MilletCHP_Anno/02.TEanno 
########build a ref TE library######
TEannot.py -P MilletCHP -C TEannot.cfg -S 1
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR -r
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM -r
TEannot.py -P MilletCHP -C TEannot.cfg -S 3 -c BLR+RM
TEannot.py -P MilletCHP -C TEannot.cfg -S 7

PostAnalyzeTELib.py -a 3 -p MilletCHP_chr_allTEs_nr_noSSR_join_path -s MilletCHP_refTEs_seq -g 907084234
GetSpecificTELibAccordingToAnnotation.py -i MilletCHP_chr_allTEs_nr_join_path_statsPerTE.txt -t MilletCHP_refTEs_seq
ln -s MilletCHP_chr_allTEs_nr_join_path_statsPerTE_FullLengthFrag.fa /home/qmtang/MilletCHP_Anno/04.re-TEanno/MilletCHP_refTEs.fa
```
    
* STEP 3: maker(Gene annotation, `/home/qmtang/MilletCHP_Anno/03-1024.Maker/` and `/home/qmtang/MilletCHP_Anno/03.Maker/`):
 * `/home/qmtang/MilletCHP_Anno/03-1024.Maker/pre-maker.py` split the genome into seperate fasta files for each scaffold/contig, then submit the maker jobs to cluster. Make sure `maker.sh`, `maker_bopts.ctl`, `maker_exe.ctl` and `maker_opts.ctl` exist in same folder.
 * To combine all jobs' output(`.gff`) together, please modify and execute the script `/home/qmtang/MilletCHP_Anno/03.Maker/combine_gff.py`. The input file `gffs_path` could be get though `ls $i/*/*/*/*/*/*gff > <gffs>`.
     
* STEP 4: TEanno(Full pipeline to get TE annotation, `/home/qmtang/MilletCHP_Anno/04.re-TEanno/`):

```
TEannot.py -P MilletCHP -C TEannot.cfg -S 1
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a BLR -r
TEannot.py -P MilletCHP -C TEannot.cfg -S 2 -a RM -r
TEannot.py -P MilletCHP -C TEannot.cfg -S 3 -c BLR+RM
TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s TRF
TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s Mreps
TEannot.py -P MilletCHP -C TEannot.cfg -S 4 -s RMSSR
TEannot.py -P MilletCHP -C TEannot.cfg -S 5
TEannot.py -P MilletCHP -C TEannot.cfg -S 6 -b tblastx
TEannot.py -P MilletCHP -C TEannot.cfg -S 6 -b blastx
TEannot.py -P MilletCHP -C TEannot.cfg -S 7
TEannot.py -P MilletCHP -C TEannot.cfg -S 8 -o GFF3
```

* STEP 5: t-RNA and r-RNA annotation
 * t-RNA refer to `/home/qmtang/Programs/tRNAscan-SE-1.3.1/`
 * r-RNA genes annotated by mapping *Arabidopsis thaliana*(TAIR10) rRNA gene sequence to genome.(`/home/qmtang/MilletCHP_Anno/07.rRNA/`)
<br />
<h4 id="3.2"><a class="cl-offset" name="cl-12"></a>3.2 Prokaryotes Genome(<i>prokka)</h4>
> * We use prokka to annotate prokaryotes genome.
> * Github: [https://github.com/tseemann/prokka](https://github.com/tseemann/prokka)
> * Citation: Seemann T. [Prokka: rapid prokaryotic genome annotation](http://www.ncbi.nlm.nih.gov/pubmed/24642063), Bioinformatics, 2014

* Locate: `/home/qmtang/Programs/prokka/`
* Requrie: refer to `/home/qmtang/Programs/prokka/README.md`
* Usage:

```
/home/qmtang/Programs/prokka/bin/prokka --cpus 0 \
    --outdir /home/qmtang/Bacillus/prokka-TG1_E1/prokka_modify --prefix TG1-E1 \
    --addgenes --rfam \
    --gram + \
    --genus Bacillus_TG1-E1 \
    --species Bacillus \
    --strain TG1-E1 \
    --proteins /home/qmtang/Bacillus/ref_ydsun/prot2003-2014.fa \
    --hmms /home/qmtang/Bacillus/Pfam31.0-A.hmm \
    /home/qmtang/Bacillus/prokka-TG1_E1/TG1-E1_correctedGenome.fasta
```

* Output: refer to [https://github.com/tseemann/prokka#output-files](https://github.com/tseemann/prokka#output-files)
<br />

[1]: http://static.zybuluo.com/Syaoran-Tang/hpu6kmys0o3vd6jyjdf92ady/IrysView_Software_Training_Guide_24.pdf

文章目录

糜子基因组组装手记（PacBio+NGS+HiC）

添加新评论

最近回复

分类

标签

功能