Type:
interface
Header:
align/iterator.h
Revision History:
2012-Feb-02 | • | initial |
2012-Feb-06 | • | rodarmer |
2012-Feb-28 | • | reflect API modifications to constructors |
Contents:
The AlignmentIterator is an interface that allows for walking an aligned sequence (or sub-sequence) described by an alignment record. Its implementation will handle the intricacies of traversing an alignment transcript, i.e. the results of comparison.
The interface is being called an iterator due in part to the tradition in C and its offspring, but for the same reason there may be good reason to change its name. The name iterator implies an ability to iterate across a sequence, but does not specify which direction, or whether multiple passes are supported, etc. It should be stated that the word does not explicitly imply random access either.
In Java this type of interface is known as an Enumeration, which is meant to imply access to a single element at a time, in forward order and in series. This is similar to a result set and a stream in that there is no concept of addressing elements, only a window onto the current element.
It is important to design an interface with this restriction in order to avoid imposing possibly severe requirements upon any implementation. We can demonstrate that random or semi-random access can be provided with interfaces built upon an enumerating interface.
ask the alignment manager to create an iterator from individual components
rc_t AlignMgrMakeAlignmentIterator ( const AlignMgr *self, AlignmentIterator **iter, bool copy, uint64_t ref_pos, uint32_t ref_len, const INSDC_4na_bin *read, uint32_t read_len, const bool *has_mismatch, const bool *has_ref_offset, const int32_t *ref_offset, uint32_t ref_offset_len );
iter - OUT
return parameter for the iterator
copy
if true, alignment mgr should copy data rather than use pointers
otherwise, lifetime of data must meet or exceed that of iterator
ref_pos
starting position of alignment in reference coordinates
ref_len
length of projection onto reference in reference space
read
full sequence of aligned read in base space
this should probably be only mismatches - see discussion
read_len
length in bases of read
has_mismatch
array of 8-bit bool values
one per base in read
value is true when base at position differs from reference
has_ref_offset
array of 8-bit bool values
one per base in read
value is true when relative position must be adjusted
ref_offset
array of 32-bit signed integer values
one for every true value in has_ref_offset
value is used to adjust relative alignment of read against reference
ref_offset_len
length in elements of ref_offset
While a bit unwieldy, this factory message allows for isolated creation
of an iterator from external data. It has this as a benefit, while at
the same time due to the requirement of maintaining
a static and non-volatile view
it requires either copying of
the input or some guarantee of the lifetime of its inputs.
Specification of the full read sequence in this interface rather than only the mismatched bases is due to the need to describe insertions that happen to be identical to surrounding bases, and therefore do not generate a mismatch. The problem is that the code which generates a full read is combining mismatched bases with the reference at the expense of a sub-select when in many situation we may already have all of the required information cached and at our fingertips. We should evaluate the performance impact of this.
duplicate an existing reference
rc_t AlignmentIteratorAddRef ( const AlignmentIterator *self );
The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.
release an existing reference
potentially whacks object
rc_t AlignmentIteratorRelease ( const AlignmentIterator *self );
The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.
NULL pointers are ignored.
advance position by +1 in reference space
must be called initially to advance to first element
rc_t AlignmentIteratorNext ( AlignmentIterator *self );
This is the main message for iterating across an alignment. Each invocation causes an advance in reference space, and the prior window is permanently lost.
The implementation will detect whether the new position has an insert or delete, followed by a match, mismatch or skip. This information is obtained via the State message.
When the iterator is initially created, its window is invalid. This is to facilitate use of the Next message within a loop to simultaneously advance the pointer and return a result code to indicate the validity of the new location.
TBD - the return code for invalid position should be made explicit.
returns a bitmap of state bits and codes at the current position
int32_t AlignmentIteratorState ( const AlignmentIterator *self, INSDC_coord_zero *seq_pos ); enum { align_iter_match = 64, align_iter_skip = 128 };
seq_pos
optional return parameter for the current position within the sequence
NB - this coordinate is within sequence space, not reference space
Most of the interesting information about the alignment at the current position is returned in a single, highly overloaded integer.
The least significant byte contains one of 17 values:
A ( 1 ) | mismatched base | |
C ( 2 ) | mismatched base | |
M ( 3 ) | mismatched base | |
G ( 4 ) | mismatched base | |
R ( 5 ) | mismatched base | |
S ( 6 ) | mismatched base | |
V ( 7 ) | mismatched base | |
T ( 8 ) | mismatched base | |
W ( 9 ) | mismatched base | |
Y ( 10 ) | mismatched base | |
H ( 11 ) | mismatched base | |
K ( 12 ) | mismatched base | |
D ( 13 ) | mismatched base | |
B ( 14 ) | mismatched base | |
N ( 15 ) | mismatched base | |
align_iter_match ( 64 ) | match | |
align_iter_skip ( 128 ) | skip |
The remainder of the state word contains zero or more bits:
enum { align_iter_insert = ( 1 << 8 ), align_iter_delete = ( 1 << 9 ), align_iter_first = ( 1 << 10 ), align_iter_last = ( 1 << 11 ), align_iter_invalid = ( 1 << 31 ) };
When the state word has its align_iter_invalid bit set, the remainder of the word is invalid and every access to the single position window of the iterator is invalid. This will occur before the first Next message and after all positions have been visited. Notice that the bit chosen for this is also the sign bit so that a test of state < 0 can be used.
An insert represents one or more bases in the sequence that are not present in the reference, and can occupy no length upon the reference sequence. When an insert is detected, it is indicated by setting this bit in the state word, at which time the caller may discover both the size and value of the insertion through the BasesInserted message. The insertion is considered to be placed immediately before a match or mismatch.
A deletion represents absence of one or more bases in the sequence that are present in the reference. When a deletion is detected, the absent bases are indicated by returning a value of align_iter_skip in the lower byte of state, while the end of a deletion is indicated by setting the align_iter_delete bit in the state word along with either a match or mismatch in the lower byte. When this happens the caller may discover the size and starting position of the deletion through the BasesDeleted message. The deletion is reported immediately before a match or mismatch which ends the deletion.
return current position of iterator relative to reference
rc_t AlignmentIteratorPosition ( const AlignmentIterator *self, uint64_t *pos );
pos - OUT
return parameter for position in reference coordinate space
Will produce non-zero return code if position is invalid.
return the number of bases inserted
optionally returns the values of the inserted bases
uint32_t AlignmentIteratorBasesInserted ( const AlignmentIterator *self, const INSDC_4na_bin **bases );
bases - OUT, NULL OKAY
optional return parameter for pointer to internally held inserted bases
not to be freed by caller - owned by iterator
Returns the length of the insertion in bases, and optionally the bases themselves.
return the number of bases deleted
optionally returns the starting position on the reference of the deletion
uint32_t AlignmentIteratorBasesDeleted ( const AlignmentIterator *self, uint64_t *pos );
pos - OUT, NULL OKAY
optional return parameter for start of deletion on reference
Returns the length of the deletion in bases, and optionally the starting position of the deletion on the reference. This can be used to retrieve bases from the reference to show the value of the deletion if so desired.