2008 Apr 27 - Sun
HDF Group's Hierarchical Data Format (HDF5) Library
I've been working with
HDF5 Group's HDF (Hierarchical Data
Format) library for the last little while. It is a mechanism for managing
self-described data collections, no matter how large or complicated. From their website,
here are a few features:
- A versatile data model that can represent very complex data objects and a wide variety
of metadata.
- A completely portable file format with no limit on the number or size of data objects
in the collection.
- A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and
Java interfaces.
- A rich set of integrated performance features that allow for access time and storage
space optimizations.
- Tools and applications for managing, manipulating, viewing, and analyzing the data in
the collection.
I'm using the HDF5 library in a stock market research and trading platform I'm developing
in
C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these
data types uses ptime from the Boost DateTime library for time referencing.
I've been able to use C++'s container and iterator concepts to write a read/write
container with appropriate custom random iterator capabilities. This allows me to use STL
(Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range
to quickly search for selected sub-ranges of the various data types.
From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release,
and
have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release
appears to have
link problems. The web pages show downloads for 1.8.0, but with a little extra digging,
there is a
HDF5 snapshot server available.
Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding
the
build documentation, which is located in the /release_docs directory of the extraction. I
used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent
versions of Winzip or
7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version
of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.
One key point is to download both zlib and szlib and put them in directories, otherwise
the HDF5 library won't build. Two environment variables are required:
- HDF5_EXT_SZIP=szlibdll.lib
- HDF5_EXT_ZLIB=zlib1.lib
To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the
windows/proj/all/all.sln file, select build/debug/library options and then build the
solution. After the build, run installhdf5lib.bat and you'll find the libraries and
includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and
use tools->options->c++ general->include files to point to the include file directory.
In order to use the library, one has to be aware of dataspaces (rank size of structures),
composite types (ie, bar is composed of time, open, close, and volume), datasets (the data
as stored on the drive), and properties (some desciptors for tuning storage abilities).
I've been able to write a vector of Bar objects out to a dataset by being particular
careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of
handling the various offsets of the base values (time, double, int) as they are written from
the class to drive and back again. This self-described dataset allows an HDF5 datafile to
be created on a little-endian machine and then read from a big-endian machine with no
problems.
Another interesting capability of the HDF5 library is in how the data is stored. As
mentioned
before, compression can be enabled with zlib (szlib has some limititations in that it is
unable to work with clustered data). Further compression can be be enabled through what
they call 'fletching'. I've been using data records which are identical in length. When
you look at a series of records, you'll find that a number of byte positions are identical:
they could be all zeros, or some other value if the data falls within a narrow range of
values across a series of records. These columns of bytes serve as a convenient first order
level of compression before using the more generic zlib flavor of compression. Large
datasets can user minimal data storage when using these two compression concepts. I havn't
done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these
on. Probably with cluster size tuning (a cluster being a specific number of records in a
block), I could further reduce storage requirements. But of
course, there will be access time considerations to handle as well.
It has taken some time to understand the concepts and subtlies of the HDF5 library, but
now
that I have, when coupled with C++ class and meta programming capabilities, and with
suitable abstractions, quite powerful data analytics can be built.
As one more highlight, there is a Java program available called HDFView which can be used
to view any HDF5 datafile. It shows just how well the self-described concepts works, as
well as being useful as a debugging aid when creating data descriptions and data sets.
2008 Apr 25 - Fri
Open Source Site of the Day: Map Server -- Spatially-enabled Internet Applications
I came across
MapServer
based upon a mention of someone in the Wt mailing list.
I havn't looked at all entries in the example galleries, but it might be interesting to
import the existing maps of Bermuda into this and create a tourist attraction. People could
upload points of interest, pictures, voice-overs, videos, and text. It could become an
ideal 'living historical document' of Bermuda, with everything geo-referenced.
the resulting database could be used by tourists to find points of interest, learn about
points of interest, and record their own thoughts.
With some extra work, and for those with PDA's, walking tours with roving maps could be
implemented. And if the PDA happen to have GPS in it, the roaming tour would truly be self
guided, interactive, random, and beneficial to those with a specific focus.
2008 Mar 24 - Mon
Open Source Site of the Day -- Processing
This should actually be filed under a few different headings (which is something I'll do
once I get the new blog software in place). Those headings being programming, software
development, visualization, animation, imaging, interacting, and open source.
It is some software called
Processing of which I speak. It is
hard to put into words what this is. I first took a look at some animation samples that
were produced for a FOX station's movie slot. They need to be seen to be believed. My next
step was to look at programming examples. The programming examples included visualization
of programming techniques...self referential, eh! More advanced examples provided the
ground work for how to do some of the flocking behaviour I viewed in the first animation
samples I viewed.
This site caters to the young and the old, programmers and visualizers. If only for the
eye-candy, it is a site to be viewed.
2008 Feb 05 - Tue
Open Source Site of the Day -- SOCI: The C++ Database Access Library
Back in October 2007, I wrote an article about
PostgreSQL, C++
and the lack of binary data and wrappers.
I stand corrected. On two counts. The first is binary access. After looking through the
PostgreSQL API, I see that I can indeed get binary data into and out of a database easily. The C based library,
libpq, has an API call named PQexecParams which allows full separation of binary values from the query
statement. This is a good step towards preventing SQL injection attacks.
The second item upon which I stand corrected is a full binary C++ wrapper around PostgreSQL. I encountered
one today: SOCI - The C++ Database Access Library. It
handles Oracle, PostgreSQL, MYSQL, as well as a number of others. The library is templated and allows
extensions of user defined binary values.
If I'm not mistaken, it also provides a capability similar to what LINQ does in CLR languages, inlining of
SQL code. Or close enough that the difference does not matter.
While on the subject of databases, I'll make mention that, for a trading application in a Windows environment
that I've been developing, I've been starting to use the
HDF Group's HDF5 Database as a repository for historical
trade information. It handles custom binary record types, handles compression on the fly with a clever byte
pre-ordering strategy, with huge file sizes. I'll write more on it later once I've had a chance to break it in.
This is a C++ based library and works in Windows as well as Linux environments. I'm also going to try this
library out for managing SNMP based data from a custom network monitoring application.
SQL based engines are good for generic query applications, but when one wants a self contained application,
it is tough to get a user to install or allow installation of a separate engine. I suppose Microsoft's embedded
SQL engine is a counter example to that argument. But I was looking for something lighter in weight. For the
trading application I mentioned earlier, I've been trying out the
Berkeley DB C++
Library. It doesn't handle the overhead of SQL queries, but does all the low level table stuff, along with
transactions and secondary indexes. The code is a little more complicated, but it makes for a nice light weight
engine for storing queryable records. The HDF5 engine is good for storing large quantities of binary data, but
lacks a query mechanism. I think the division of labour between the two storage mechanism for a trading and a
network management application works well.
2008 Jan 28 - Mon
Open Source Site of the Day -- TrueCrypt: Free open-source disk encryption software for Windows Vista/XP/2000 and Linux
When trying to keep things private and personal, and to carry things around securely on
USB keys, or even hard drives, nothing beats the simplicity and flexibility of
TrueCrypt. It is free and and it
is Open Source. A new release is scheduled for February 4, 2008.
Here is a feature list from the main web page:
- Creates a virtual encrypted disk within a file and mounts it as a real disk.
- Encrypts an entire hard disk partition or a storage device such as USB flash drive.
- Encryption is automatic, real-time (on-the-fly) and transparent.
- Provides two levels of plausible deniability
- Encryption algorithms: AES-256, Serpent, and Twofish. Mode of operation: LRW.
The software is well documented and has a good startup tutorial.
I had seen this software a while ago, and hadn't done much with it. I was recently
reminded of this through
A Nice Surprise, an article written by a SANS' NewsBytes Editor regarding
data loss and theft. As the author says, with software so simple and easy, why isn't it
used more often?
The same author has a page designated as the
The Six Dumbest Ideas in Computer Security. In brief, the points are:
- Default Permit
- Enumerating Badness
- Penetrate and Patch
- Hacking is Cool
- Educating Users
- Action is Better Than Inaction
The article is well worth the read. It is an excellent dissertation on how one should
change one's security philosophy to get at problems at the source, rather than attempting to
make the symptoms go away.
2007 Oct 06 - Sat
Open Source Site of the Day -- OSSWAD: Open Source Savvy Web Application Developer
An OSSWAD, almost sounds like a forbidden word. Based upon
Bob Zurek's column, there are Osswads, and then there are OSSWADs. Many
developers of web sites 'out there' already use Open Source tools to get the job done.
Apache, MySQL, PostgreSQL, Perl, and PHP are commonly used tools by Osswads.
On the other hand, one gets the impression that real, enterprise savvy OSSWADs, use some
real hard core tools like:
- hadoop: large scale
distributed compute clusters. Amazon has a developer resource center with a
Hadoop/MapReduce Paper.
- lucene: not just search
software, but something that provides the basis for getting meaningfull context. There is
an informative online book to go
alone with it.
- nutch: adds web
specifics to lucene: web crawling, link-graphs, and parsers.
These Open Source projects fill in more of the gaps of the thinking I've been doing on
providing some contextual search products in specific subject matter areas. It looks like I
won't have to do as much ground up development as I thought I might need to do.
2007 Aug 22 - Wed
Open Source Site of the Day -- Some Sites Referencing SNMP Sites
2007 Aug 04 - Sat
Open Source Site of the Day -- Xen: OS Virtualization
VMWare is probably the pre-eminent hypervisor. Microsoft is
trying to catch up with their VM. At one point, you couldnt' get Linux based
guests to run (something that has now changed due to popular demand).
On the Open Source front, Xen appears to be doing well. And through contributions from both Intel
and AMD, unmodified guest operating systems such as MS Windows can be run within a Xen
environment. (Intel VT or AMD-V microprocessor capabilities required).
From an installation point of view, .bootstrap has an article Installing Xen
on Debian Etch 4.0. Russell Coker has a couple of how-to's on installing Xen, to which
I'll just reference through his Popular Posts page.
Open Source Site of the Day -- PowerTOP: Saving Power with Linux on Intel Platforms
Intel spent some time analyzing the causes of power usage of Linux applications on
laptops. To help users maximize laptop battery life, Intel came up with a helpful utility
called PowerTOP.
Based upon what I've read, certain USB components and products are the worst contributors
to bad battery life. Slashdot readers have various and sundry things to say about battery
performance in general. Matthew has things to say about it in particular, especially about USB
stuff not conforming to published specifications.
|