VDB-2 - AlignmentIterator

Type:
interface

Header:
align/iterator.h

Revision History:
2012-Feb-02initial
2012-Feb-06rodarmer
2012-Feb-28reflect API modifications to constructors

Contents:


Description

The AlignmentIterator is an interface that allows for walking an aligned sequence (or sub-sequence) described by an alignment record. Its implementation will handle the intricacies of traversing an alignment transcript, i.e. the results of comparison.

The interface is being called an iterator due in part to the tradition in C and its offspring, but for the same reason there may be good reason to change its name. The name iterator implies an ability to iterate across a sequence, but does not specify which direction, or whether multiple passes are supported, etc. It should be stated that the word does not explicitly imply random access either.

In Java this type of interface is known as an Enumeration, which is meant to imply access to a single element at a time, in forward order and in series. This is similar to a result set and a stream in that there is no concept of addressing elements, only a window onto the current element.

It is important to design an interface with this restriction in order to avoid imposing possibly severe requirements upon any implementation. We can demonstrate that random or semi-random access can be provided with interfaces built upon an enumerating interface.

Requirements

  1. Must operate within the garbage collection paradigm of the code base
  2. Must provide a mechanism for advancing forward by exactly one position in reference coordinate space
  3. Must maintain a static and non-volatile view of the alignment at a single reference position
  4. Must not provide any other means of modifying position on the reference
  5. Must not provide access to any other part of the alignment than that which corresponds to the current position

Interface

Make

ask the alignment manager to create an iterator from individual components

rc_t AlignMgrMakeAlignmentIterator ( const AlignMgr *self,
    AlignmentIterator **iter, bool copy,
    uint64_t ref_pos, uint32_t ref_len,
    const INSDC_4na_bin *read, uint32_t read_len,
    const bool *has_mismatch, const bool *has_ref_offset,
    const int32_t *ref_offset, uint32_t ref_offset_len );

iter - OUT
return parameter for the iterator

copy
if true, alignment mgr should copy data rather than use pointers
otherwise, lifetime of data must meet or exceed that of iterator

ref_pos
starting position of alignment in reference coordinates

ref_len
length of projection onto reference in reference space

read
full sequence of aligned read in base space
this should probably be only mismatches - see discussion

read_len
length in bases of read

has_mismatch
array of 8-bit bool values
one per base in read
value is true when base at position differs from reference

has_ref_offset
array of 8-bit bool values
one per base in read
value is true when relative position must be adjusted

ref_offset
array of 32-bit signed integer values
one for every true value in has_ref_offset
value is used to adjust relative alignment of read against reference

ref_offset_len
length in elements of ref_offset

While a bit unwieldy, this factory message allows for isolated creation of an iterator from external data. It has this as a benefit, while at the same time due to the requirement of maintaining a static and non-volatile view it requires either copying of the input or some guarantee of the lifetime of its inputs.

Specification of the full read sequence in this interface rather than only the mismatched bases is due to the need to describe insertions that happen to be identical to surrounding bases, and therefore do not generate a mismatch. The problem is that the code which generates a full read is combining mismatched bases with the reference at the expense of a sub-select when in many situation we may already have all of the required information cached and at our fingertips. We should evaluate the performance impact of this.

AddRef

duplicate an existing reference

rc_t AlignmentIteratorAddRef ( const AlignmentIterator *self );

The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.

Release

release an existing reference
potentially whacks object

rc_t AlignmentIteratorRelease ( const AlignmentIterator *self );

The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.

NULL pointers are ignored.

Next

advance position by +1 in reference space
must be called initially to advance to first element

rc_t AlignmentIteratorNext ( AlignmentIterator *self );

This is the main message for iterating across an alignment. Each invocation causes an advance in reference space, and the prior window is permanently lost.

The implementation will detect whether the new position has an insert or delete, followed by a match, mismatch or skip. This information is obtained via the State message.

When the iterator is initially created, its window is invalid. This is to facilitate use of the Next message within a loop to simultaneously advance the pointer and return a result code to indicate the validity of the new location.

TBD - the return code for invalid position should be made explicit.

State

returns a bitmap of state bits and codes at the current position

int32_t AlignmentIteratorState ( const AlignmentIterator *self,
    INSDC_coord_zero *seq_pos );

enum
{
    align_iter_match      = 64,
    align_iter_skip       = 128
};

seq_pos
optional return parameter for the current position within the sequence
NB - this coordinate is within sequence space, not reference space

Most of the interesting information about the alignment at the current position is returned in a single, highly overloaded integer.

The least significant byte contains one of 17 values:
A ( 1 )mismatched base
C ( 2 )mismatched base
M ( 3 )mismatched base
G ( 4 )mismatched base
R ( 5 )mismatched base
S ( 6 )mismatched base
V ( 7 )mismatched base
T ( 8 )mismatched base
W ( 9 )mismatched base
Y ( 10 )mismatched base
H ( 11 )mismatched base
K ( 12 )mismatched base
D ( 13 )mismatched base
B ( 14 )mismatched base
N ( 15 )mismatched base
align_iter_match ( 64 ) match
align_iter_skip ( 128 )skip

The remainder of the state word contains zero or more bits:

enum
{
    align_iter_insert     = ( 1 <<  8 ),
    align_iter_delete     = ( 1 <<  9 ),
    align_iter_first      = ( 1 << 10 ),
    align_iter_last       = ( 1 << 11 ),
    align_iter_invalid    = ( 1 << 31 )
};

When the state word has its align_iter_invalid bit set, the remainder of the word is invalid and every access to the single position window of the iterator is invalid. This will occur before the first Next message and after all positions have been visited. Notice that the bit chosen for this is also the sign bit so that a test of state < 0 can be used.

An insert represents one or more bases in the sequence that are not present in the reference, and can occupy no length upon the reference sequence. When an insert is detected, it is indicated by setting this bit in the state word, at which time the caller may discover both the size and value of the insertion through the BasesInserted message. The insertion is considered to be placed immediately before a match or mismatch.

A deletion represents absence of one or more bases in the sequence that are present in the reference. When a deletion is detected, the absent bases are indicated by returning a value of align_iter_skip in the lower byte of state, while the end of a deletion is indicated by setting the align_iter_delete bit in the state word along with either a match or mismatch in the lower byte. When this happens the caller may discover the size and starting position of the deletion through the BasesDeleted message. The deletion is reported immediately before a match or mismatch which ends the deletion.

Position

return current position of iterator relative to reference

rc_t AlignmentIteratorPosition ( const AlignmentIterator *self, uint64_t *pos );

pos - OUT
return parameter for position in reference coordinate space

Will produce non-zero return code if position is invalid.

BasesInserted

return the number of bases inserted
optionally returns the values of the inserted bases

uint32_t AlignmentIteratorBasesInserted
    ( const AlignmentIterator *self, const INSDC_4na_bin **bases );

bases - OUT, NULL OKAY
optional return parameter for pointer to internally held inserted bases
not to be freed by caller - owned by iterator

Returns the length of the insertion in bases, and optionally the bases themselves.

BasesDeleted

return the number of bases deleted
optionally returns the starting position on the reference of the deletion

uint32_t AlignmentIteratorBasesDeleted ( const AlignmentIterator *self, uint64_t *pos );

pos - OUT, NULL OKAY
optional return parameter for start of deletion on reference

Returns the length of the deletion in bases, and optionally the starting position of the deletion on the reference. This can be used to retrieve bases from the reference to show the value of the deletion if so desired.


NCBI VDB-2 Documentation