CG-Pipeline

From Compgenomics

Jump to: navigation, search

CG-Pipeline is a tool for assembling sequence data (sff or fasta) and running feature prediction and annotation tools on the assembly. It is divided into the three pipelines which may be used individually or in sequence automatically.

Contents

Conceptual Overview

Setup

These instructions will go into effect for release 0.3.0. Meanwhile refer to doc/INSTALL for the current installation procedure.

Prerequisites

CG-Pipeline runs a series of existing applications, and it is up to the user to install each one and make sure they are in your PATH (so you can run the application from any working directory).

The following packages are required. However, their installation may present some difficulties, and some parts of the pipeline will still be able to function without them.

Install

The latest version is available here: CG-Pipeline source archive

Unpack it (replace x.x.x with the latest version)

$ tar zxvf cg_pipeline-x.x.x.tgz

Change to the directory cg_pipeline-x.x.x

$ cd cg_pipeline-x.x.x

Check for missing dependencies; install anything that is missing.

$ make test
  • Option 1: Install the package on your system AND download the (very large) databases from SwissProt, NCBI, etc.
$ make install  #Install in the default directory /opt/cg_pipeline
-OR-
$ make install DESTDIR=/usr/local/bin/cg_pipeline  #Use an alternative installation directory
  • Option 2: Install the package on your system, then download the (very large databases).
$ make install-app DESTDIR=/usr/local/bin/cg_pipeline
$ cd /usr/local/bin/cg_pipeline # if DESTDIR was changed, as in this example
$ make init_databases

Finally: Manually add the pipeline scripts folder to your PATH. Refer to your system's documentation for modifying the PATH environment variable.

Configuration

The configuration file is /opt/cg_pipeline/conf/cgpipelinerc and it might look like this:

classification = "Neisseria meningitidis Neisseria Neisseriaceae Neisseriales Betaproteobacteria Proteobacteria Bacteria"
prediction_blast_db = "/opt/cg_pipeline/data/uniprot_sprot"
prediction_transl_table = 11
prediction_use_genemark = 1
gms_datadir = "/usr/local/share/gms"
annotation_blast_db = "/opt/cg_pipeline/data/uniprot_sprot_trembl"
annotation_uniprot_db3 = "/opt/cg_pipeline/data/cgpipeline.db3"
annotation_uniprot_evidence_db3 = "/opt/cg_pipeline/data/cgpipeline.evidence.db3"
reporting_email = none
vfdb_blast_db = "/opt/cg_pipeline/data/vfdb_CP_VFs_aa"
min_vfdb_aa_coverage = 0.8
min_vfdb_aa_identity = 0.95

Edit the config file. Specify the locations of each database or tool to be used. You may manually edit the file or run make config in the installation directory (/opt/cg-pipeline/) (since CG-Pipeline version 0.3.0):

$ cd /opt/cg_pipeline # or your custom installation directory
$ make config

Settings

classificationTaxonomy of your organism
prediction_blast_dbLocation of the BLAST database used for homology search in the Prediction pipeline. This 'path' includes the blast database name which is the name of the files minus the file extension. For the example above, the files uniprot_sprot.phr, uniprot_sprot.pin, and uniprot_sprot.psq should be in the directory /opt/cg_pipeline/data/. See BLAST documentation.
prediction_transl_tableBLAST setting for codon translation table (11 for most prokaryotes)
prediction_use_genemarkUse GeneMark(=1) or not(=0). This affects whether gene predictions are filtered by the pipeline. If GeneMark is used, the pipeline will accept predicted coding regions where at least 2 tools predict it - Glimmer, GeneMark, and BLAST. If GeneMark is not used, the pipeline will accept genes predicted by either Glimmer or BLAST.
gms_datadirThe path to the GeneMark directory (usually the same as the location of GeneMark binaries)
annotation_blast_dbThe path to the BLAST data used by the Annotation pipeline. See notes for prediction_blast_db concerning BLAST database names.
annotation_uniprot_db3Location of the file cgpipeline.db3
annotation_uniprot_evidence_db3Location of the file cgpipeline.evidence.db3
vfdb_blast_dbVirulence Factor Database BLAST database. See notes for prediction_blast_db concerning BLAST database names.
min_vfdb_aa_coverageVirulence Factor Database BLAST search minimum coverage setting
min_vfdb_aa_identityVirulence Factor Database BLAST search minimum identity setting

Using run_pipeline

In the most likely scenario, you will have installed CG-Pipeline on your computer, and you will run several pipeline projects, one project for each data set. Running the pipeline means invoking the application run_pipeline with one of the available commands (build, create, assemble, predict, annotate). Details are given in the usage message:

$ run_pipeline -h  # show help message

It is possible, for a given input data set, to run the entire pipeline with one command:

$ run_pipeline build -i input_file.sff
$ # (A project folder is created in the current directory, and the pipeline will attempt to run in sequence assembly, prediction and annotation.)

However, you may want to run only one component of the pipeline, or you want to stop and verify the results at each stage, as described next.

Stepwise Procedure

Create a Project

Regardless of which pipeline tools you want to run or what kind of input data you have, running individual commands requires you to specify the name of the project, i.e. the name of a project directory, which is created thus:

$ run_pipeline create -p GenomeProject1

This creates a directory named GenomeProject1 and several subdirectories:

GenomeProject1/build ## working directory for sub-pipelines
GenomeProject1/build/assembly
GenomeProject1/build/prediction
GenomeProject1/build/annotation
GenomeProject1/annotation ## data files in pipe-delimited format from each tool in the annotation pipeline
GenomeProject1/log ## log messages from each sub-pipeline

For all commands from this point on, for this project, always include -p GenomeProject1 when invoking run_pipeline.

Assembly

de novo assembly:

$ run_pipeline assemble -p GenomeProject1 -i myinput.sff

Reference assembly

$ run_pipeline assemble -p GenomeProject1 -i myinput.sff -r ref.fna

The final output will be the fasta file GenomeProject1/assembly.fasta.

Prediction

If you have just assembled your own data as in the previous section, continue the pipeline with feature prediction.

$ run_pipeline predict -p GenomeProject1

-OR- If you are using pre-assembled input, specify the input file:

$ run_pipeline predict -p GenomeProject1 -i myinput.fasta

The final output will be the GenBank file GenomeProject1/prediction.gb.

Annotation

$ run_pipeline annotate -p GenomeProject1

The final output will be the GenBank file GenomeProject1/annotation.gb.

Communication

Personal tools