CG-Pipeline
From Compgenomics
CG-Pipeline is a tool for assembling sequence data (sff or fasta) and running feature prediction and annotation tools on the assembly. It is divided into the three pipelines which may be used individually or in sequence automatically.
Contents |
Conceptual Overview
Setup
These instructions will go into effect for release 0.3.0. Meanwhile refer to doc/INSTALL for the current installation procedure.
Prerequisites
CG-Pipeline runs a series of existing applications, and it is up to the user to install each one and make sure they are in your PATH (so you can run the application from any working directory).
- BioPerl version 1.6 or later bioperl.org
- NCBI BLAST ftp site
- AMOS AMOS Project
- Glimmer3 cbcb.umd.edu
- tRNAscan-SE lowelab.ucsc.edu
The following packages are required. However, their installation may present some difficulties, and some parts of the pipeline will still be able to function without them.
- InterProScan European Bioinformatics Institute
- TMHMM cbs.dtu.dk
- SignalP cbs.dtu.dk
- Newbler (Roche 454 offInstrumentApps) (required for 454 Pyrosequencing mode of the assembly stage only) Contact Roche to obtain this software
Install
The latest version is available here: CG-Pipeline source archive
Unpack it (replace x.x.x with the latest version)
$ tar zxvf cg_pipeline-x.x.x.tgz
Change to the directory cg_pipeline-x.x.x
$ cd cg_pipeline-x.x.x
Check for missing dependencies; install anything that is missing.
$ make test
- Option 1: Install the package on your system AND download the (very large) databases from SwissProt, NCBI, etc.
$ make install #Install in the default directory /opt/cg_pipeline -OR- $ make install DESTDIR=/usr/local/bin/cg_pipeline #Use an alternative installation directory
- Option 2: Install the package on your system, then download the (very large databases).
$ make install-app DESTDIR=/usr/local/bin/cg_pipeline $ cd /usr/local/bin/cg_pipeline # if DESTDIR was changed, as in this example $ make init_databases
Finally: Manually add the pipeline scripts folder to your PATH. Refer to your system's documentation for modifying the PATH environment variable.
Configuration
The configuration file is /opt/cg_pipeline/conf/cgpipelinerc and it might look like this:
classification = "Neisseria meningitidis Neisseria Neisseriaceae Neisseriales Betaproteobacteria Proteobacteria Bacteria" prediction_blast_db = "/opt/cg_pipeline/data/uniprot_sprot" prediction_transl_table = 11 prediction_use_genemark = 1 gms_datadir = "/usr/local/share/gms" annotation_blast_db = "/opt/cg_pipeline/data/uniprot_sprot_trembl" annotation_uniprot_db3 = "/opt/cg_pipeline/data/cgpipeline.db3" annotation_uniprot_evidence_db3 = "/opt/cg_pipeline/data/cgpipeline.evidence.db3" reporting_email = none vfdb_blast_db = "/opt/cg_pipeline/data/vfdb_CP_VFs_aa" min_vfdb_aa_coverage = 0.8 min_vfdb_aa_identity = 0.95
Edit the config file. Specify the locations of each database or tool to be used. You may manually edit the file or run make config in the installation directory (/opt/cg-pipeline/) (since CG-Pipeline version 0.3.0):
$ cd /opt/cg_pipeline # or your custom installation directory $ make config
Settings
| classification | Taxonomy of your organism |
| prediction_blast_db | Location of the BLAST database used for homology search in the Prediction pipeline. This 'path' includes the blast database name which is the name of the files minus the file extension. For the example above, the files uniprot_sprot.phr, uniprot_sprot.pin, and uniprot_sprot.psq should be in the directory /opt/cg_pipeline/data/. See BLAST documentation. |
| prediction_transl_table | BLAST setting for codon translation table (11 for most prokaryotes) |
| prediction_use_genemark | Use GeneMark(=1) or not(=0). This affects whether gene predictions are filtered by the pipeline. If GeneMark is used, the pipeline will accept predicted coding regions where at least 2 tools predict it - Glimmer, GeneMark, and BLAST. If GeneMark is not used, the pipeline will accept genes predicted by either Glimmer or BLAST. |
| gms_datadir | The path to the GeneMark directory (usually the same as the location of GeneMark binaries) |
| annotation_blast_db | The path to the BLAST data used by the Annotation pipeline. See notes for prediction_blast_db concerning BLAST database names. |
| annotation_uniprot_db3 | Location of the file cgpipeline.db3 |
| annotation_uniprot_evidence_db3 | Location of the file cgpipeline.evidence.db3 |
| vfdb_blast_db | Virulence Factor Database BLAST database. See notes for prediction_blast_db concerning BLAST database names. |
| min_vfdb_aa_coverage | Virulence Factor Database BLAST search minimum coverage setting |
| min_vfdb_aa_identity | Virulence Factor Database BLAST search minimum identity setting |
Using run_pipeline
In the most likely scenario, you will have installed CG-Pipeline on your computer, and you will run several pipeline projects, one project for each data set. Running the pipeline means invoking the application run_pipeline with one of the available commands (build, create, assemble, predict, annotate). Details are given in the usage message:
$ run_pipeline -h # show help message
It is possible, for a given input data set, to run the entire pipeline with one command:
$ run_pipeline build -i input_file.sff $ # (A project folder is created in the current directory, and the pipeline will attempt to run in sequence assembly, prediction and annotation.)
However, you may want to run only one component of the pipeline, or you want to stop and verify the results at each stage, as described next.
Stepwise Procedure
Create a Project
Regardless of which pipeline tools you want to run or what kind of input data you have, running individual commands requires you to specify the name of the project, i.e. the name of a project directory, which is created thus:
$ run_pipeline create -p GenomeProject1
This creates a directory named GenomeProject1 and several subdirectories:
GenomeProject1/build ## working directory for sub-pipelines GenomeProject1/build/assembly GenomeProject1/build/prediction GenomeProject1/build/annotation GenomeProject1/annotation ## data files in pipe-delimited format from each tool in the annotation pipeline GenomeProject1/log ## log messages from each sub-pipeline
For all commands from this point on, for this project, always include -p GenomeProject1 when invoking run_pipeline.
Assembly
de novo assembly:
$ run_pipeline assemble -p GenomeProject1 -i myinput.sff
Reference assembly
$ run_pipeline assemble -p GenomeProject1 -i myinput.sff -r ref.fna
The final output will be the fasta file GenomeProject1/assembly.fasta.
Prediction
If you have just assembled your own data as in the previous section, continue the pipeline with feature prediction.
$ run_pipeline predict -p GenomeProject1
-OR- If you are using pre-assembled input, specify the input file:
$ run_pipeline predict -p GenomeProject1 -i myinput.fasta
The final output will be the GenBank file GenomeProject1/prediction.gb.
Annotation
$ run_pipeline annotate -p GenomeProject1
The final output will be the GenBank file GenomeProject1/annotation.gb.
Communication
- Subscribe to the user mailing list
