One Unified Global Perspective
Communications with a Global Perspective
Home
Intro
Contact Us
Voice over IP
PBX Solutions
Services
Support
Glossary
Open Source
Blog
Forum

WebMail





2008 Apr 27 - Sun

HDF Group's Hierarchical Data Format (HDF5) Library

I've been working with HDF5 Group's HDF (Hierarchical Data Format) library for the last little while. It is a mechanism for managing self-described data collections, no matter how large or complicated. From their website, here are a few features:

  • A versatile data model that can represent very complex data objects and a wide variety of metadata.
  • A completely portable file format with no limit on the number or size of data objects in the collection.
  • A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
  • A rich set of integrated performance features that allow for access time and storage space optimizations.
  • Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

I'm using the HDF5 library in a stock market research and trading platform I'm developing in C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these data types uses ptime from the Boost DateTime library for time referencing.

I've been able to use C++'s container and iterator concepts to write a read/write container with appropriate custom random iterator capabilities. This allows me to use STL (Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range to quickly search for selected sub-ranges of the various data types.

From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release, and have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release appears to have link problems. The web pages show downloads for 1.8.0, but with a little extra digging, there is a HDF5 snapshot server available.

Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding the build documentation, which is located in the /release_docs directory of the extraction. I used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent versions of Winzip or 7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.

One key point is to download both zlib and szlib and put them in directories, otherwise the HDF5 library won't build. Two environment variables are required:

  • HDF5_EXT_SZIP=szlibdll.lib
  • HDF5_EXT_ZLIB=zlib1.lib

To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the windows/proj/all/all.sln file, select build/debug/library options and then build the solution. After the build, run installhdf5lib.bat and you'll find the libraries and includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and use tools->options->c++ general->include files to point to the include file directory.

In order to use the library, one has to be aware of dataspaces (rank size of structures), composite types (ie, bar is composed of time, open, close, and volume), datasets (the data as stored on the drive), and properties (some desciptors for tuning storage abilities).

I've been able to write a vector of Bar objects out to a dataset by being particular careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of handling the various offsets of the base values (time, double, int) as they are written from the class to drive and back again. This self-described dataset allows an HDF5 datafile to be created on a little-endian machine and then read from a big-endian machine with no problems.

Another interesting capability of the HDF5 library is in how the data is stored. As mentioned before, compression can be enabled with zlib (szlib has some limititations in that it is unable to work with clustered data). Further compression can be be enabled through what they call 'fletching'. I've been using data records which are identical in length. When you look at a series of records, you'll find that a number of byte positions are identical: they could be all zeros, or some other value if the data falls within a narrow range of values across a series of records. These columns of bytes serve as a convenient first order level of compression before using the more generic zlib flavor of compression. Large datasets can user minimal data storage when using these two compression concepts. I havn't done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these on. Probably with cluster size tuning (a cluster being a specific number of records in a block), I could further reduce storage requirements. But of course, there will be access time considerations to handle as well.

It has taken some time to understand the concepts and subtlies of the HDF5 library, but now that I have, when coupled with C++ class and meta programming capabilities, and with suitable abstractions, quite powerful data analytics can be built.

As one more highlight, there is a Java program available called HDFView which can be used to view any HDF5 datafile. It shows just how well the self-described concepts works, as well as being useful as a debugging aid when creating data descriptions and data sets.

[/OpenSource/SiteOfTheDay/D200804] permanent link


2008 Apr 25 - Fri

Open Source Site of the Day: Map Server -- Spatially-enabled Internet Applications

I came across MapServer based upon a mention of someone in the Wt mailing list.

I havn't looked at all entries in the example galleries, but it might be interesting to import the existing maps of Bermuda into this and create a tourist attraction. People could upload points of interest, pictures, voice-overs, videos, and text. It could become an ideal 'living historical document' of Bermuda, with everything geo-referenced.

the resulting database could be used by tourists to find points of interest, learn about points of interest, and record their own thoughts.

With some extra work, and for those with PDA's, walking tours with roving maps could be implemented. And if the PDA happen to have GPS in it, the roaming tour would truly be self guided, interactive, random, and beneficial to those with a specific focus.

[/OpenSource/SiteOfTheDay/D200804] permanent link



Blog Content ©2008
Ray Burkholder
All Rights Reserved
ray@oneunified.net
(441) 505 7293
Available for Contract Work
Resume

RSS: Click to see the XML version of this web page.

View Ray 
Burkholder's profile on LinkedIn
technorati
Add to Technorati Favorites



July
Su Mo Tu We Th Fr Sa
    5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    


Main Links:
Monitoring Server
SSH Tools
QuantDeveloper Code

Special Links:
Frink

Blog Links:
Sergey Solyanik
Marc Andreessen
HotGigs
Micro Persuasion
... Reasonable ...
Chris Donnan
BeyondVC
lifehacker
Trader Mike
Ticker Sense
HeadRush
TraderFeed
Stock Bandit
The Daily WTF
Guy Kawaski
J. Brant Arseneau
Steve Pavlina
Matt Cutts
Kevin Scaldeferri
Joel On Software
Quant Recruiter
Blosxom User Group
Wesner Moise
Julian Dunn
Steve Yegge

2008
Months
JulAug Sep
Oct Nov Dec




Mason HQ

Disclaimer: This site may include market analysis. All ideas, opinions, and/or forecasts, expressed or implied herein, are for informational purposes only and should not be construed as a recommendation to invest, trade, and/or speculate in the markets. Any investments, trades, and/or speculations made in light of the ideas, opinions, and/or forecasts, expressed or implied herein, are committed at your own risk, financial or otherwise.