VDB-2 - PlacementIterator

Type:
interface

Header:
align/iterator.h

Revision History:
2012-Feb-02initial
2012-Feb-28reflect API modifications to constructors, structures
2012-May-04documented negative placement starting coords

Contents:


Description

The PlacementIterator is an interface that allows for walking a window of placements along the reference of a single run. On each iteration, one or more placements become available at a position until the placements are exhausted within the window.

Requirements

  1. Must operate within the garbage collection paradigm of the code base
  2. Must provide a mechanism for determining the position and length of the next available placement
  3. Must provide a mechanism for obtaining all of the placements at a stated position
  4. Placements must minimally contain row-id, position on the reference, and length on the reference
  5. Placements may optionally contain any other data as chosen by user
  6. Must provide a mechanism for allowing user to add data to a placement record
  7. Placements must be accessed in canonical order
  8. Placements ordered by ascending position, descending length and ascending id
  9. Placement objects must be provided if requested
  10. Placement ids may be provided if requested

PlacementRecord

The placement record is described as an open structure as part of the requirement to allow a user to extend this record.

Structure

open structure of PlacementRecord - to be extended by user

struct PlacementRecord
{
    DLNode n;
    int64_t id;
    const ReferenceObj *ref;
    INSDC_coord_zero pos;
    INSDC_coord_len len;
    int32_t mapq;
};

n
the structure is designed for inclusion in a doubly-linked list

id
the row-id of the placement (alignment) within its alignment table

ref
object representing reference sequence
each record gets its own counted reference

pos
the starting position of the placement on the reference
coordinates are zero-based NB - pos can be negative (see below)

len
the length of the placement on the reference

mapq
stated mapping quality of alignment

The idea of this structure is to provide an interface both for its consumer and its producer, to be handled by the iterator.

When the iterator is used only for walking placements but not for looking within at the actual alignment, this structure is unlikely to be extended, since it gives the user the ability to quickly determine spatial relationships without detail at each position.

However, when zooming in on base-per-base alignments, the mode of operation will shift toward creation of richly populated records that can be individually examined at the resolution of a single base position.

The inclusion of mapq here is for the purposes of denormalization, giving the earliest possible filtering.

There is a case when alignment placements may be given with a negative starting coordinate. This happens when an alignment has been found to wrap around a circular reference and terminate at a lower coordinate than where it starts. These alignments are linearized by subtracting the length of the circular reference from the starting coordinate. This keeps the start < end.

Cast

cast a placement record to one of two possible extension objects
allows up to three independent classes to be combined

void PlacementRecordWhack ( const PlacementRecord *self );

If the user provided a whack function (automatically stored within the object), it will be called to clean up and dispose of the record.

Otherwise, the implementation will simply call free() to release memory.

Whack

douse a placement record
calls user code if provided

void PlacementRecordWhack ( const PlacementRecord *self );

If the user provided a whack function (automatically stored within the object), it will be called to clean up and dispose of the record.

Otherwise, the implementation will simply call free() to release memory.

PlacementIterator

Make

ask the alignment manager to create an iterator from individual components

rc_t AlignMgrMakePlacementIterator ( const AlignMgr *self,
    PlacementIterator **iter, uint64_t ref_pos, uint32_t ref_len,
    int64_t starting_ref_row, uint32_t ref_row_count,
    const VCursor *ref, const VCursor *align, bool secondary,
    rc_t ( * CC populate ) ( PlacementRecord **rec, const VCursor *align,
        int64_t id, uint64_t pos, uint32_t len, void *data ), void *data,
    void ( * CC whack ) ( void *obj ) );

iter - OUT
return parameter for the iterator

ref_pos
starting position of alignment in reference coordinates

ref_len
length of projection onto reference in reference space

starting_ref_row
starting row within ref cursor
externally determined to include desired window

ref_row_count
the number of rows to read from ref cursor

ref
cursor onto REFERENCE table of run
will be modified as necessary to include required columns
will be opened by iterator

align
cursor onto either PRIMARY_ALIGNMENT or SECONDARY_ALIGNMENT of table of run
which one is indicated by secondary param
will be modified as necessary to include required columns
will be opened by iterator

secondary
boolean true if align cursor is on SECONDARY_ALIGNMENT table

populate - NULL OKAY
optional callback function to generate richly populated PlacementRecord

data - OPAQUE
user data sent in callback to populate function

whack - NULL OKAY
optional destructor/deallocator function
may be ignored if populate is NULL

The user will translate the position and length of the window onto the reference into a range of row-ids within the REFERENCE table. This range should be sufficiently ample to discover placements that may begin BEFORE the window but still intersect with it.

The user will create two read-only cursors for a given cSRA object - one on the REFERENCE table and another on one of the two possible alignment tables, depending upon whether primary or secondary alignments are being examined. These will be used to construct the iterator object.

Indication of whether the align table is primary or secondary affects the iterator's query onto the reference table, which is why it is supplied as a stand-alone parameter.

If the user intends to examine a placement in any greater detail than its id, position and length projected upon the reference, then a callback function should be supplied. This function will allocate a structure having as its first member a PlacementRecord and should initialize any additional members within the function:

struct MyPlacementRecord
{
    PlacementRecord dad;

    const INSDC_dna_text *read;
};

static
rc_t MyPopulateFunc ( PlacementRecord **recp, const VCursor *align,
    int64_t id, uint64_t pos, uint32_t len, void *data )
{
    rc_t rc;
    struct MyPlacementRecord *rec;

    /* allocate structure - error handling omitted... */
    rec = malloc ( sizeof * rec );

    /* id, pos and len are provided for convenience,
       but I don't have to use them or fill out dad. */

    /* initialize my part of the record */
    rc = read_and_copy_READ ( align, & rec -> read );

    /* return to iterator */
    * recp = & rec -> dad;
    return rc;
}

static
void MyWhackFunc ( void *obj )
{
    struct MyPlacementRecord *rec = obj;
    free ( rec -> read );
    free ( rec );
}

As shown above, a custom populate function will often beg a custom destructor/deallocator function. NB: if you provide such a function, it MUST deallocate the object.

AddRef

duplicate an existing reference

rc_t PlacementIteratorAddRef ( const PlacementIterator *self );

The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.

Release

release an existing reference
potentially whacks object

rc_t PlacementIteratorRelease ( const PlacementIterator *self );

The object is defined as being reference counted. In VDB-2, references are direct pointers to objects and the objects maintain a reference counter.

NULL pointers are ignored.

NextAvailPos

check the next available position on reference having one or more placements
returns position and optionally length

rc_t PlacementIteratorNextAvailPos ( const PlacementIterator *self,
    uint64_t *pos, uint64_t *len );

pos - OUT
the reference position where the next available placement starts NB - can be negative if the alignment wraps around

len - OUT, NULL OKAY
optional parameter returning the length of the next available placement

This message returns information about the next available placement, or if none are available, causes the iterator to search for more in its open cursors.

If no further placements are found, a non-zero return code will be issued. TBD

The exact position returned is used to read placement records using either NextRecordAt or NextIdAt.

The optional returned length is useful for performing a merge-sort on the available placements from several iterators. This message may be safely invoked any number of times, where the only side-effect possible is a single attempt at retrieving more data (on the initial invocation).

NextRecordAt

retrieve and consume next available PlacementRecord

rc_t PlacementIteratorNextRecordAt ( PlacementIterator *self,
    uint64_t pos, const PlacementRecord **rec );

pos
the exact position returned by
NextAvailPos
identifies location being queried

rec - OUT
return parameter for the next available placement at pos

This message allows a single record to be obtained on each invocation, where the intent is that the caller will loop until no further records are found at the stated position.

By looping, the code is not forced to create lists of placements that align at the exact same starting point, which further allows using multiple iterators in a sort-merge configuration.

As mentioned before, the record is designed to be held in a doubly-linked list and freed independently. The caller obtains locally sorted records from this iterator and places them into the list.

NextIdAt

retrieve information from the next available PlacementRecord
douse the record upon return

rc_t PlacementIteratorNextIdAt ( PlacementIterator *self,
    uint64_t pos, int64_t *row_id, uint64_t *len );

pos
the exact position returned by
NextAvailPos
identifies location being queried

row_id - OUT
return parameter for the next placement's id

len - OUT, NULL OKAY
optional return parameter for the next placement's length

This message simply extracts information held within internal records. See NextRecordAt.


NCBI VDB-2 Documentation