SRA Toolkit File Formats
Database file formats
The contents of the SRA archives are not suitable for off the shelf database management systems
due to the large size of archives. Compression of the contents was critical to cost effectively
keeping the database on-line with fast retrieval.
To increase compression a column major format of the data was used rather than a more common
row major format where all the contents of a row's tuple are kept together. Like types compress
better than unlike types due to a better predictabilty of the content. Duplicate rows on contiguous
columns are commpon and can be comressed with a run length encoding and being like types further
compression can be more effective by predicting the range and volitilty of the data within the
column.
The native file system of the target platform is used also unlike a DBMS which manages all the space
within a large file or even an entire disk volume. This gains us in efficieny of development and
run time where only those file which need to be opened are opened and multiple processes acting on
separate database objects do not have to co-ordinate reads and writes.
Columns
A column is kept in its own directory along with any supporting files needed such as various index files.
Tables
A table is a collection of related columns. It is also kept in it's own directory. A row is made up of an entry from each column to make
an N-tuple. Most older DNA file formats are stored within the SRA system as a table.
Databases
In the SRA tool kit a database is typically a set of tables though it can contain additional columns(?)
or other databases. Again it is kept in its own directory.
Archive file
The SRA Toolkit's native file format is a single file archive that contains the files and directories needed
for the database object as described above.
Superficially the format resembles the older tar file format or zip compressed archive but with differences
to allow efficient processing without extraction. UNlike a tar file the directory information for the whole
file is kept together so it can be read and processed at one time rather than scanning the whole file to
find directory information spread throughout the file. UNlike a zip file the individual columns are already
compressed and the archive does not attempt a futile effort to compress them further.
SRA
An SRA file is a single file table in an archive format.
cSRA
A cSRA file is a single file database with several expected tables in an archive format.
Extraction data formats
The NCBI does not encourage always extracting the data from the archive as our archive formats are compact
and easy to use. But if other formats are needed for specific uses the data can be extracted in full or
part to other formats. These formats have many variations and options to the extraction programs
allow for spcific forms of the formats.
Input formats?
Are we documenting loaders yet?