VDB-2 Configuration Files

Type:
file format

Revision History:
2012-Jan-23sra-tools
2012-Jan-30sra-tools
2012-Feb-13sra-tools

Contents:


Background

Most database software runs as a server, with separate client-side libraries that provide APIs to the application. The app sends requests through the API to the server, and the server responds to the app through the API.

Both the application and the server maintain separate configurations. It is as unreasonable to expect that an application would be able to configure the database server, as it would be for the server to configure the application. Each is maintained by separate authorities who need their own control over their own resources.

VDB is provided as a series of user-space libraries that implement both the client APIs and the server code within the application's process. Although the server is not running in a separate process, it still requires its own configuration.

VDB is configured through the use of its own configuration files. This document will address the structure of these files, their required and optional content, and the processes used to locate them.

Structure

Simple file example

# path to VDB schema files
vdb/schema/paths = "/home/me/schema"

# path to locally downloaded references
refseq/paths = "/data/refseq"

Configuration files contain traditional name=value pairs, where the name in this case is formed as a hierarchical path. Generally, library sub-sections organize their configuration information under headings, much like a directory.

Comments

config files can contain three types of comments

## shell-style comments
#  all text from the hash to the end of line is ignored

The most commonly used comment is the hash-style line comment. All characters including the hash through the end of line are ignored.

/* C-style comments
   allow for multiple lines
   terminate with slash-star ( NOT end of line )
 */

C-style comments are also supported and are useful for both multi-line and partial line cases. All characters starting from slash-star up to and including a balancing star-slash are ignored.

PL/I-style, that is BCPL-style, popularly known as C++-style comments are NOT supported since they would potentially conflict with network path patterns, and they duplicate functionality provided by hash comments.

Defining path values

VDB-2 config files consist of path=value pairs
these are like name=value pairs except that the name is hierarchical.

simple_name = "string value"

The left-hand portion of the definition gives a simple name in this example. A name must start with a letter and can be followed by any repeating combination of letters, numerals, underscore ( _ ) and dash ( - ).

In version 1.0, the assignment operator is a single equal sign ( = ). Surrounding white space is ignored.

In version 1.0, the value is given as a string, which must be either quoted using double quotes ( " " ) or single quotes ( ' ' ).

a/hierarchical/name = 'another string value'

In this example, the left-hand side of the assignment is a hierarchical name, also known as a path. It behaves the same as a simple name, except that it allows for creation of namespaces. This is the approach used exclusively by VDB.

File Content

While there is no prescribed content requirement for well-formed vdb-2 configuration files, VDB-2 itself does make use of certain paths.

Within NCBI there are requirements that impact the ability to locate objects within our archives. Outside of NCBI there are requirements to process objects compressed by reference.

Absence of configuration for these nodes will not cause an error in and of itself, but it may prevent VDB from operating as expected, e.g. sam-dump may fail due to the inability to locate a reference sequence.

VDB library

vdb/module/paths - OPTIONAL
a hard-path to architecture-specific read-only modules
normally omitted (see below)

vdb/wmodule/paths - OPTIONAL
a hard-path to architecture-specific read/write modules used for loaders
normally omitted (see below)

vdb/schema/paths
a hard-path to root of schema includes
needed by loaders and vdb-copy

VDB modules can be located either by a hard-coded path in configuration (as shown above) or by proximity to the vdb library itself. The latter approach is normally used, since it allows the same configuration to be used with any architecture or build settings, e.g. 32-bit vs. 64-bit, release vs. debug.

Examples:

# path to VDB read-only external modules
vdb/module/paths = "/home/vdb/mod"

# path to VDB read/write external modules
vdb/wmodule/paths = "/home/vdb/wmod"

# path to VDB schema files
vdb/schema/paths = "/home/vdb/schema"

SRA repository

needed to access repository directly
not currently useful outside of INSDC

sra/servers
colon-separated list of paths to file servers

sra/ncbi/volumes
colon-separated list of volumes upon file servers
these volumes use NCBI-style sub-directories

sra/ebi/volumes
colon-separated list of volumes upon file servers
these volumes use EBI-style sub-directories

sra/ddbj/volumes
colon-separated list of volumes upon file servers
these volumes use DDBJ-style sub-directories, identical to NCBI-style

The SRA is implemented on top of VDB, which needs this information to resolve accessions to full paths within the repository.

NCBI and DDBJ use the same approach for creating sub-directories upon an SRA volume: the accession is split into ASCII prefix and number, and the number is split into banks of 1024 accessions. The volume-relative path for accession SRR012345 would be:

  prefix = "SRR"
  bank   = 012345 / 1024 = 000012 # ( integer division )
  path   = "SRR/000012/SRR012345"

EBI uses a slightly different approach designed for human navigation. They divide the accession number by 1000, and keep the prefix in the middle. The volume-relative path for accessionERR021966 would be:

  prefix = "ERR"
  bank   = "ERR" + 021966 / 1000 = "ERR021"
  path   = "ERR/ERR021/ERR021966"

Examples:

# one or more paths to SRA file servers
sra/servers = "/servers/main"

# SRA volumes upon servers
sra/ncbi/volumes = "sra4:sra3:sra2:sra1:sra0"
sra/ebi/volumes = "era1:era0"
sra/ddbj/volumes = "dra0"

Reference sequence repository

needed to access reference sequences
required to read most runs compressed by reference

refseq/paths
colon-separated list of paths to local repositories

refseq/servers - DEPRECATED
colon-separated list of file servers

refseq/volumes - DEPRECATED
colon-separated list of volumes upon file servers

The preferred means of indicating paths to reference sequence objects is via the refseq/paths node. The older method can be used to construct full paths from the servers and volumes, much like SRA.

A description of where reference sequence objects may be found is required by VDB any time it is reconstructing sequences that were compressed against external references, which is the majority of cases. Absence of this information will cause queries to fail.

Examples:

# one or more paths to reference sequence objects
refseq/paths = "/servers/main/refseq"

Finding Files

One main task of configuration files is to tell VDB where to locate other files and objects. But there is an issue as to where exactly it should find configuration files.

VDB employes three strategies for automatic configuration. They are listed below as default, installed, and override strategies. Although listed in this order, they are actually tested in exactly the reverse order.

Default bootstrap

The method used when all else fails

Many of our users will either run pre-built binaries or will build tools from sources. In most cases they will not have the ability to perform any sort of proper installation due to lack of administrative privileges.

To facilitate boot-strapping without any administrative support, VDB will trace the location of its configuration code back to the file system, and use the path to its binary as a starting point.

Specifically, it traces the location of the configuration library, libkfg to the directory holding its binary**. For example:

    # location of configuration library
    /usr/local/ncbi/vdb/libs/libkfg.so

    # starting point for locating configuration files
    /usr/local/ncbi/vdb/libs

** NB - if the configuration library has been statically linked into your binary, then the starting location will be taken from the path to your binary.

The configuration library will then search in its own vicinity for configuration files:

  1. test for existence of directory {starting-path}/ncbi
  2. scan directory {starting-path}/ncbi for all files with extension ".kfg"
  3. process each file found in host file system's alphabetical order

In the example above, one might expect to find:

    /usr/local/ncbi/vdb/libs/libkfg.so
    /usr/local/ncbi/vdb/libs/ncbi/config.kfg
    /usr/local/ncbi/vdb/libs/ncbi/refseq.kfg
    /usr/local/ncbi/vdb/libs/ncbi/sra.kfg

Installed bootstrap

The second method attempted
Used when not overridden or when the override did not work

With administrative privileges, VDB can be installed with its configuration files located in a canonical location:

    # canonical location of VDB configuration files
    /etc/ncbi

The configuration library will search /etc/ncbi for configuration files:

  1. test for existence of directory /etc/ncbi
  2. scan directory /etc/ncbi for all files with extension ".kfg"
  3. process each file found in host file system's alphabetical order

In the example above, one might expect to find:

    /etc/ncbi/config.kfg
    /etc/ncbi/refseq.kfg
    /etc/ncbi/sra.kfg

Override rules

The first method attempted
Useful when no administrative privileges are available
Also useful when creating/controlling specific configurations

Overrides are accomplished by setting an environment variable and making it available to VDB:

    # variable to export to VDB
    VDB_CONFIG

By setting this variable to one or more paths, VDB is given the chance to look for configuration files. The process is:

  1. test for existence of $VDB_CONFIG
  2. split $VDB_CONFIG contents on ':' path separator
  3. test each path for existence and type
  4. scan each directory for all files with extension ".kfg"
  5. process each file found in host file system's alphabetical order
  6. process directly each path that names a file

In the example above, one might expect to find:

    # notice that directly named files do NOT require .kfg extension
    export VDB_CONFIG=$PRODUCTION/ncbi:$HOME/my-special-config.txt

Within NCBI, the systems team has traditionally used this approach within ~/.ncbi_hints for the vdb facilities entry. It would typically set up an environment:

    # the first variable is currently NOT exported or used by VDB
    VDB_ROOT=/net/snowman/vol/projects/trace_software/vdb
    export VDB_CONFIG=$VDB_ROOT/config:$VDB_ROOT/linux/config

By specifying two directories, we place common configuration within the first config, and Linux-specific configuration within the second. They all combine to populate the VDB internal configuration object.

Grammar

Version 1.0 (current)

This grammar describes configuration files current as of January, 2012. It will be extended as more needs are recognized.

config
    : name_value_pairs
    | /* empty */
    ;

name_value_pairs
    : name_value_pair
    | name_value_pairs name_value_pair
    ;

name_value_pair
    : path assign_op value line_end
    | EOLN
    ;

path
    : ABS_PATH
    | REL_PATH
    ;

assign_op
    : '='
    ;

value
    : STRING
    | ESCAPED_STRING
    ;

line_end
    : EOLN
    | END_INPUT
    ;

The lexical tokens are represented in UPPER CASE in the grammar, and are defined below:

 /* NAMED EXPRESSIONS */

 /* node/name in a path */
path_node                       [A-Za-z_0-9][-.A-Za-z_0-9]*

%%

 /* RULES */

 /* multi-line comments */
\/\*                                            PUSH ( CMT_SLASH_STAR )
<CMT_SLASH_STAR,CMT_MULTI_LINE>[^*\n]+          IGNORE
<CMT_SLASH_STAR,CMT_MULTI_LINE>\*+[^*/\n]+      IGNORE
<CMT_SLASH_STAR,CMT_MULTI_LINE>\**\n            ENTER (CMT_MULTI_LINE )
<CMT_SLASH_STAR>\*+\/                           POP & IGNORE
<CMT_MULTI_LINE>\*+\/                           POP & EOLN

 /* line comments */
#.*                                             IGNORE

[ \t\f\v\r]                                     IGNORE

 /* end of line */
\n                                              EOLN

 /* normal, POSIX-style paths */
\/{path_node}(\/{path_node})*                   ABS_PATH
{path_node}(\/{path_node})*                     REL_PATH

 /* values */
'[^\\'\f\r\n]*'                                 STRING
'(\\.|[^\\'\f\r\n])+'                           ESCAPED_STRING
\"[^\\"\f\r\n]*\"                               STRING
\"(\\.|[^\\"\f\r\n])+\"                         ESCAPED_STRING

 /* punctuation */
"="                                             '='

%%

The grammar above accepts input having zero or more path=value pairs.

It also allows for partial, single or multi-line comments as well as empty lines.

The scanner is responsible for forming correct POSIX-style paths on the left-hand side of assignments, and in version 1.0 the only value allowed on the right-hand side is an uninterpreted character string.

The scanner recognizes a single assignment operator for v1.0. The token id is identity.

The scanner recognizes escape sequences but does not attempt to convert or limit them to reasonable or meaningful sequences. This means that any character can be escaped to mean itself, while special characters will be converted, as discussed below:

\n
converted to newline character ( 0x0A )

\t
converted to tab character ( 0x09 )

\r
converted to return character ( 0x0D )

\0 ( zero )
converted to NUL character ( 0x00 )

\a
converted to alert ( bell ) character ( 0x07 )

\v
converted to vertical tab character ( 0x0B )

\f
converted to form-feed character ( 0x0C )

\[xX][:xdigit:][:xdigit:]
converted to UTF-8 byte

\[uU][:xdigit:][:xdigit:][:xdigit:][:xdigit:]
converted to UCS-2 character

Noticeably absent in the escape sequences above is \e, which was an unintentional oversight.

Version 1.1 (under development)

This grammar describes proposed extensions to configuration files.
It is intended to be backward compatible with v1.0.

config
    : config_entries
    | /* empty */
    ;

config_entries
    : config_entry
    | config_entries config_entry
    ;

config_entry
    : path_name path_definition
    | EOLN
    ;

path_definition
    : assign_op expression line_end
    | '{' config_entries '}' line_end
    | EOLN '{' config_entries '}' line_end
    ;

path_name
    : ABS_PATH
    | REL_PATH
    ;

assign_op
    : '='
    | PLUS_EQUALS
    ;

expression
    : 

unary_expression
    : postfix_expression
    | '-' unary_expression
    ;

postfix_expression
    : primary_expression
    | primary_expression size_modifier
    ;

primary_expression
    : path
    | string
    | integer
    | float
    | SYMBOL
    | '(' expression ')'
    ;

path
    : ABS_PATH
    | REL_PATH
    | DOS_ABS_PATH
    | DOS_REL_PATH
    | UNC_PATH
    ;

string
    : STRING
    | ESCAPED_STRING
    ;

integer
    : DECIMAL
    | HEX
    ;

float
    : FLOAT
    ;

line_end
    : EOLN
    | END_INPUT
    ;

The lexical tokens are represented in UPPER CASE in the grammar, and are defined below:

 /* NAMED EXPRESSIONS */

 /* node/name in a path */
path_node                       [A-Za-z_0-9][-.A-Za-z_0-9]*

%%

 /* RULES */

 /* multi-line comments */
\/\*                                            PUSH ( CMT_SLASH_STAR )
<CMT_SLASH_STAR,CMT_MULTI_LINE>[^*\n]+          IGNORE
<CMT_SLASH_STAR,CMT_MULTI_LINE>\*+[^*/\n]+      IGNORE
<CMT_SLASH_STAR,CMT_MULTI_LINE>\**\n            ENTER (CMT_MULTI_LINE )
<CMT_SLASH_STAR>\*+\/                           POP & IGNORE
<CMT_MULTI_LINE>\*+\/                           POP & EOLN

 /* line comments */
#.*                                             IGNORE

[ \t\f\v\r]                                     IGNORE

 /* end of line */
\n                                              EOLN

 /* normal, POSIX-style paths */
\/{path_node}(\/{path_node})*                   ABS_PATH
{path_node}(\/{path_node})*                     REL_PATH

 /* values */
'[^\\'\f\r\n]*'                                 STRING
'(\\.|[^\\'\f\r\n])+'                           ESCAPED_STRING
\"[^\\"\f\r\n]*\"                               STRING
\"(\\.|[^\\"\f\r\n])+\"                         ESCAPED_STRING

 /* punctuation */
"="                                             '='

%%

The grammar above accepts input having zero or more path=value pairs.

It also allows for partial, single or multi-line comments as well as empty lines.

The scanner is responsible for forming correct POSIX-style paths on the left-hand side of assignments.

The scanner recognizes an assignment operator where the meaning is a fully recursive value string, i.e. symbolic values will be evaluated upon retrieval.

The scanner recognizes escape sequences but does not attempt to convert or limit them to reasonable or meaningful sequences. This means that any character can be escaped to mean itself, while special characters will be converted, as discussed below:

\n
converted to newline character ( 0x0A )

\t
converted to tab character ( 0x09 )

\r
converted to return character ( 0x0D )

\0 ( zero )
converted to NUL character ( 0x00 )

\a
converted to alert ( bell ) character ( 0x07 )

\v
converted to vertical tab character ( 0x0B )

\f
converted to form-feed character ( 0x0C )

\e
converted to escape character ( 0x1B )

\[xX][:xdigit:][:xdigit:]
converted to UTF-8 byte

\[uU][:xdigit:][:xdigit:][:xdigit:][:xdigit:]
converted to UCS-2 character


NCBI VDB-2 Documentation