2008 Apr 27 - Sun
HDF Group's Hierarchical Data Format (HDF5) Library
I've been working with
HDF5 Group's HDF (Hierarchical Data
Format) library for the last little while. It is a mechanism for managing
self-described data collections, no matter how large or complicated. From their website,
here are a few features:
- A versatile data model that can represent very complex data objects and a wide variety
of metadata.
- A completely portable file format with no limit on the number or size of data objects
in the collection.
- A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and
Java interfaces.
- A rich set of integrated performance features that allow for access time and storage
space optimizations.
- Tools and applications for managing, manipulating, viewing, and analyzing the data in
the collection.
I'm using the HDF5 library in a stock market research and trading platform I'm developing
in
C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these
data types uses ptime from the Boost DateTime library for time referencing.
I've been able to use C++'s container and iterator concepts to write a read/write
container with appropriate custom random iterator capabilities. This allows me to use STL
(Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range
to quickly search for selected sub-ranges of the various data types.
From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release,
and
have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release
appears to have
link problems. The web pages show downloads for 1.8.0, but with a little extra digging,
there is a
HDF5 snapshot server available.
Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding
the
build documentation, which is located in the /release_docs directory of the extraction. I
used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent
versions of Winzip or
7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version
of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.
One key point is to download both zlib and szlib and put them in directories, otherwise
the HDF5 library won't build. Two environment variables are required:
- HDF5_EXT_SZIP=szlibdll.lib
- HDF5_EXT_ZLIB=zlib1.lib
To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the
windows/proj/all/all.sln file, select build/debug/library options and then build the
solution. After the build, run installhdf5lib.bat and you'll find the libraries and
includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and
use tools->options->c++ general->include files to point to the include file directory.
In order to use the library, one has to be aware of dataspaces (rank size of structures),
composite types (ie, bar is composed of time, open, close, and volume), datasets (the data
as stored on the drive), and properties (some desciptors for tuning storage abilities).
I've been able to write a vector of Bar objects out to a dataset by being particular
careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of
handling the various offsets of the base values (time, double, int) as they are written from
the class to drive and back again. This self-described dataset allows an HDF5 datafile to
be created on a little-endian machine and then read from a big-endian machine with no
problems.
Another interesting capability of the HDF5 library is in how the data is stored. As
mentioned
before, compression can be enabled with zlib (szlib has some limititations in that it is
unable to work with clustered data). Further compression can be be enabled through what
they call 'fletching'. I've been using data records which are identical in length. When
you look at a series of records, you'll find that a number of byte positions are identical:
they could be all zeros, or some other value if the data falls within a narrow range of
values across a series of records. These columns of bytes serve as a convenient first order
level of compression before using the more generic zlib flavor of compression. Large
datasets can user minimal data storage when using these two compression concepts. I havn't
done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these
on. Probably with cluster size tuning (a cluster being a specific number of records in a
block), I could further reduce storage requirements. But of
course, there will be access time considerations to handle as well.
It has taken some time to understand the concepts and subtlies of the HDF5 library, but
now
that I have, when coupled with C++ class and meta programming capabilities, and with
suitable abstractions, quite powerful data analytics can be built.
As one more highlight, there is a Java program available called HDFView which can be used
to view any HDF5 datafile. It shows just how well the self-described concepts works, as
well as being useful as a debugging aid when creating data descriptions and data sets.
2008 Apr 25 - Fri
Open Source Site of the Day: Map Server -- Spatially-enabled Internet Applications
I came across
MapServer
based upon a mention of someone in the Wt mailing list.
I havn't looked at all entries in the example galleries, but it might be interesting to
import the existing maps of Bermuda into this and create a tourist attraction. People could
upload points of interest, pictures, voice-overs, videos, and text. It could become an
ideal 'living historical document' of Bermuda, with everything geo-referenced.
the resulting database could be used by tourists to find points of interest, learn about
points of interest, and record their own thoughts.
With some extra work, and for those with PDA's, walking tours with roving maps could be
implemented. And if the PDA happen to have GPS in it, the roaming tour would truly be self
guided, interactive, random, and beneficial to those with a specific focus.
|