NPG-explorer, Nucleotide PanGenome explorer

Table of contents

Instruction

Download NPG-explorer

Download prebuild static executables for Windows and Linux.

Warning. The installer and the program doesn't work from a directory with non-ascii letters in name.

BLAST and other dependencies (except Qt 4 in Linux version) are included. Warning. To use Linux version, you should install Qt4 library. In Debian or Ubuntu Qt4 can be installed by command sudo apt-get install libqtgui4.

In the following instructions, replace x.y.z with the version of NPG-explorer you use.

For Windows 32 bit, download and run file npge_x.y.z_win32.exe as administrator.

For Windows 64 bit, download and run file npge_x.y.z_win64.exe as administrator.

For Linux 64 bit, download and unpack file npge_x.y.z_lin64.tar.gz (using command tar -xf npge_x.y.z_lin64.tar.gz).

Windows version adds itself to PATH, so you can use commands npge and qnpge in command line. In Linux, you need to add the unpacked directory npge-x.y.z to PATH. If you use bash, open ~/.bashrc in your favorite text editor and add the following line to the end: export PATH=$PATH:/path/to/npge-x.y.z. Do not forget to replace /path/to/npge-x.y.z with the actual path.

Create a directory for the results

All files with data and output files have fixed names. This is why it is recomended to make separate working directory for each task.

Prepare files with genomes

Input file: table of genomes

Create file genomes.tsv of the form:

all:embl:CP003176 BRUAO chr1 c Brucella abortus A13334 chr 1
all:embl:CP003177 BRUAO chr2 c Brucella abortus A13334 chr 2
fasta:refseqn:NC_016778.1 BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:embl:CP003174 BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
features:file:BRUCA.fasta BRUCA chr1 c Brucella canis HSK A52141 chr 1
fasta:file:base.fasta[CP002459] BRUMM chr1 c Brucella melitensis
features:file:base.embl[CP002459] BRUMM chr1 c Brucella melitensis

Fields are:

The program downloads input data using dbfetch.

Annotations in the following formats are parsed by the program: GenBank, EMBL.

String CP003175 BRUCA chr2 c corresponds to EMBL entry CP003175 which is represented by short genome name BRUCA, chromosome name chr2 and is circular.

You can use contigs instead of chromosomes, if genome is not fully assembled. Set circularity to 'l' in this case.

Create empty directory and create file genomes.tsv with the table of genomes to be used to build pangenome. npge will create files and sub-folders in current directory. You can change location of output files using command line options. To see all options, add -h to a command. To set path to table file (instead of genomes.tsv), pass option --table to commands GetFasta, GetGenes and Rename.

Examples of genomes.tsv

Examples of genomes.tsv files can be found in directory examples of the source code.

Here we provide a table for 3 Brucella genomes:

all:embl:CP002459   BRUMM   chr1    c   Brucella melitensis M28 chromosome 1
all:embl:CP002460   BRUMM   chr2    c   Brucella melitensis M28 chromosome 2
all:embl:CP003176   BRUAO   chr1    c   Brucella abortus A13334 chromosome 1
all:embl:CP003177   BRUAO   chr2    c   Brucella abortus A13334 chromosome 2
all:embl:CP002078   BRUPB   chr1    c   Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079   BRUPB   chr2    c   Brucella pinnipedialis B2/94 chromosome 2

... and for 17 Brucella genomes as well:

all:embl:CP003176   BRUAO   chr1    c   Brucella abortus A13334 chromosome 1
all:embl:CP003177   BRUAO   chr2    c   Brucella abortus A13334 chromosome 2
all:embl:CP003174   BRUCA   chr1    c   Brucella canis HSK A52141 chromosome 1
all:embl:CP003175   BRUCA   chr2    c   Brucella canis HSK A52141 chromosome 2
all:embl:CP002459   BRUMM   chr1    c   Brucella melitensis M28 chromosome 1
all:embl:CP002460   BRUMM   chr2    c   Brucella melitensis M28 chromosome 2
all:embl:CP001851   BRUM5   chr1    c   Brucella melitensis M5-90 chromosome I
all:embl:CP001852   BRUM5   chr2    c   Brucella melitensis M5-90 chromosome II
all:embl:CP002931   BRUML   chr1    c   Brucella melitensis NI chromosome I
all:embl:CP002932   BRUML   chr2    c   Brucella melitensis NI chromosome II
all:embl:CP002078   BRUPB   chr1    c   Brucella pinnipedialis B2/94 chromosome 1
all:embl:CP002079   BRUPB   chr2    c   Brucella pinnipedialis B2/94 chromosome 2
all:embl:CP003128   BRUSS   chr1    c   Brucella suis VBI22 chromosome I
all:embl:CP003129   BRUSS   chr2    c   Brucella suis VBI22 chromosome II
all:embl:AE017223   BRUAB   chr1    c   Brucella abortus biovar 1 str. 9-941 chromosome I
all:embl:AE017224   BRUAB   chr2    c   Brucella abortus biovar 1 str. 9-941 chromosome II
all:embl:CP000887   BRUA1   chr1    c   Brucella abortus S19 chromosome 1
all:embl:CP000888   BRUA1   chr2    c   Brucella abortus S19 chromosome 2
all:embl:AM040264   BRUA2   chr1    c   Brucella melitensis biovar Abortus 2308 chromosome I
all:embl:AM040265   BRUA2   chr2    c   Brucella melitensis biovar Abortus 2308 chromosome II
all:embl:CP000872   BRUC2   chr1    c   Brucella canis ATCC 23365 chromosome I
all:embl:CP000873   BRUC2   chr2    c   Brucella canis ATCC 23365 chromosome II
all:embl:CP001488   BRUMB   chr1    c   Brucella melitensis ATCC 23457 chromosome I
all:embl:CP001489   BRUMB   chr2    c   Brucella melitensis ATCC 23457 chromosome II
all:embl:AE008917   BRUME   chr1    c   Brucella melitensis bv. 1 str. 16M chromosome I
all:embl:AE008918   BRUME   chr2    c   Brucella melitensis 16M chromosome II
all:embl:CP001578   BRUMC   chr1    c   Brucella microti CCM 4915 chromosome 1
all:embl:CP001579   BRUMC   chr2    c   Brucella microti CCM 4915 chromosome 2
all:embl:CP000708   BRUO2   chr1    c   Brucella ovis ATCC 25840 chromosome I
all:embl:CP000709   BRUO2   chr2    c   Brucella ovis ATCC 25840 chromosome II
all:embl:AE014291   BRUSU   chr1    c   Brucella suis 1330 chromosome I
all:embl:AE014292   BRUSU   chr2    c   Brucella suis 1330 chromosome II
all:embl:CP000911   BRUSI   chr1    c   Brucella suis ATCC 23445 chromosome I
all:embl:CP000912   BRUSI   chr2    c   Brucella suis ATCC 23445 chromosome II

The latter one is used below.

Open the command line

Further steps are performed in the command line.

Linux users are expected to be familiar with the command line.

How to launch the command line in Windows:

Prepare sequences and genes

Run the following command:

$ npge Prepare

The following files are created by this command:

Files genomes-raw.fasta and features.embl contain unprocessed input data. They are not used by following steps. You can safely remove them.

Examine prepared sequences

Run the following command:

$ npge Examine

The following files are created by this command in directory examine:

This step is needed to gather some information about input genomes. This information can be used on next step (configuration).

Set values of global options

To change values of global options, make file npge.conf using command npge -g npge.conf, then edit this file to change values of global options. File npge.conf contains default values compiled into the program. Sometimes they have to be changed to improve results.

Configuration file looks like this:

MIN_IDENTITY = Decimal('0.9')
MIN_LENGTH = 100

Decimal values are specified using the syntax above. Accuracy of decimal values is 4 digits after the point.

The program applies following configuration files (if exist):

Build nucleotide pangenome

$ npge MakePangenome

This command creates file pangenome/pangenome.bs. The file is in BlockSet format.

Check nucleotide pangenome (optional)

$ npge CheckPangenome

This command makes sure that nucleotide pangenome in file pangenome/pangenome.bs satisfies pangenome criteria.

The command prints if the pangenome is Ok and may print some comments about the pangenome.

This step is done by post-processing as well, result is saved to file check/isgood.

Run post-processing of nucleotide pangenome

$ npge PostProcessing

This command produces many files, some of them are located in sub-folders.

Files *.bs contain blocksets, *.bi contain tables of blocks' properties, *.ba contain blockset alignments (cells are fragments) *.blocks contain blockset alignments (cells are blocks)

Columns of files *.bi:

Files *.bi also contain numbers of occurrences of a block in a genome. Each genome adds one column to the table. To add similar columns with numbers of occurrences of a block in a sequence, add option --info-count-seqs=1 to command npge PostProcessing. File pangenome/pangenome-small.bi contains short version of pangenome/pangenome.bi, that lacks columns with occurrences in genomes.

Files produced by npge MakePangenome and npge PostProcessing are as follows:

How to view .tre files using FigTree: open a file with FigTree, set branch label to "Diagnostic positions" in pop-up window, go to "Branch Labels" section of left menu, enable the section's checkbox. Abstract distances between nodes are shown under branches. To show number of diagnostic positions between corresponding clades, select "Diagnostic positions" in drop-down list.

View results in graphical user interface

$ qnpge
Graphical User Interface of NPG-explorer

Graphical User Interface of NPG-explorer

This command uses pangenome/pangenome.bs and some of files created by PostProcessing.

The program window is splitted to 3 parts:

Columns of blocks table:

You can filter blocks by block name, gene name or their sequence using input located up to block table. By default, pattern matching is wildcard. ^ before the pattern and $ after the pattern correspond to name start/end (as in regular expressions). To hide blocks of one fragment, clock checkbox "only blocks of >= 2 fragments". Blocks table can be sorted by any column.

Blockset alignment table shows alignment of fragment on genomes. Chromosome can be selected using drop-down list located up to blockset table. Each sequence is represented as a row of blockset table. Name of a sequence and its orientation against the alignment is written in first column. Fragments of a sequence are represented by cells of blockset table. Fragments of one block are coloured similarly. Orientation of a fragment against the alignment is indicated by '<' and '>'.

When you navigate in blocks table and blockset alignment, the alignment of the corresponding block is shown in bottom part of the program. Fragment name is shown left to alignment itself. Background colors in alignment correspond to nucleotide types. Name of the selected gene is shown in read-only input located up to the alignment. You can disable genes representation completely by unchecking the checkbox "show genes". Genes are coloured with foreground color white. Genes on reverse chain (relatively to the fragment orientation) are marked with underscore. Overlapping genes are coloured with purple. Start codons are coloured with black, stop codons are coloured with gray. Consensus of the block is shown up to the alignment. Identical columns without gaps are coloured with black, identical columns with gaps are coloured with gray, non-identical columns are white. Columns numbers are shown up to consensus.

Columns numbers of low similarity regions are coloured with red. Low similarity regions represent parts of blocks with unreliable alignment. There are 3 possible reasons of occurrence of low similarity blocks: * these sequences are not related, * recombination, * deletion and insertion in the same position.

You can use arrows keys to navigate through the alignment. Corresponding fragment is selected in blockset alignment. Use keys "Home" and "End" to go to first and last columns of the alignment respectively. If you "go away" from the alignment, the program switches to corresponding block. You go to next gene boundary if you press Ctrl + Arrow Right or Ctrl + Arrow Left. You go to next low similarity region if you press Shift + Arrow Right or Shift + Arrow Left.

To change order of sequences in blockset alignment and block alignment, select some rows (you can use Ctrl to select multiple rows) and press Ctrl + Arrow Up or Ctrl + Arrow Down.

Requirements of a good pangenome

Blocks types:

blocks types

blocks types

Major blocks = non-minor blocks.

Build and Install

Main executables are command line tool src/tool/npge (or src/tool/npge.exe) and GUI tool src/gui/qnpge (src/gui/qnpge.exe).

To change compiled-in default settings, run ccmake . in build directory.

To generate config file, run npge -g npge.conf and change generated file npge.conf.

Requirements

Warning. Make sure you use the same Lua which luabind was linked against. Otherwise it compiles but doesn't work: PANIC: unprotected error in call to Lua API (attempt to index a nil value)

Optional:

Linux

To build static Linux package in fresh Debian Wheezy, install curl and sudo and run curl -L https://git.io/vmB9P | sh

Install build requirements (on Debian):

% ./linux/requirements.sh

Build the program as static executables (Qt 4 is not static!):

$ ./linux/build.sh

The program is built in the directory npge-build-linux.

Create distribution .tar.gz file: go into npge-build-linux and run:

$ ./linux/package.sh

How to build manually:

$ ./src/init_lua-npge.sh
$ mkdir build
$ cd build
$ cmake ..
$ make

Pass argument -DNPGE_STATIC_LINUX:BOOL=1 to after cmake to get static executables (on Debian, Qt 4 is not static!).

Build README.html:

$ pandoc -s -o README.html README.md

Run tests:

$ make test

Windows

To build static Windows packages in fresh Debian Wheezy, install curl and sudo and run curl -L https://git.io/vmB9v | sh

Windows executables are cross-compiled from Linux using MinGW cross-compiler.

For 64-bit Windows you need to export MXE_TARGET variable:

$ export MXE_TARGET=x86_64-w64-mingw32.static

Install build requirements (on Debian):

% ./windows/requirements.sh

Build the program as static executables:

$ ./windows/build.sh

The program is built in the directories npge-build-windows32 and npge-build-windows64 (first contains 32-bit version and second contains 64-bit version).

Create ZIP file and Installation Wizard for Windows, go into build directory (npge-build-windows32 or npge-build-windows64) and run:

$ ./windows/package.sh

Changelog

This work was presented at ECCB'14 conference: abstract (ru) and poster.

Corresponding author: Boris Nagaev, email: bnagaev@gmail.com

Copyright (C) 2012-2016 Boris Nagaev