2008 Apr 28 - Mon
HDF Group's Hierarchical Data Format (HDF5) Library
I've been working with
HDF5 Group's HDF (Hierarchical Data
Format) library for the last little while. It is a mechanism for managing
self-described data collections, no matter how large or complicated. From their website,
here are a few features:
- A versatile data model that can represent very complex data objects and a wide variety
of metadata.
- A completely portable file format with no limit on the number or size of data objects
in the collection.
- A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and
Java interfaces.
- A rich set of integrated performance features that allow for access time and storage
space optimizations.
- Tools and applications for managing, manipulating, viewing, and analyzing the data in
the collection.
I'm using the HDF5 library in a stock market research and trading platform I'm developing
in
C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these
data types uses ptime from the Boost DateTime library for time referencing.
I've been able to use C++'s container and iterator concepts to write a read/write
container with appropriate custom random iterator capabilities. This allows me to use STL
(Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range
to quickly search for selected sub-ranges of the various data types.
From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release,
and
have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release
appears to have
link problems. The web pages show downloads for 1.8.0, but with a little extra digging,
there is a
HDF5 snapshot server available.
Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding
the
build documentation, which is located in the /release_docs directory of the extraction. I
used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent
versions of Winzip or
7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version
of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.
One key point is to download both zlib and szlib and put them in directories, otherwise
the HDF5 library won't build. Two environment variables are required:
- HDF5_EXT_SZIP=szlibdll.lib
- HDF5_EXT_ZLIB=zlib1.lib
To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the
windows/proj/all/all.sln file, select build/debug/library options and then build the
solution. After the build, run installhdf5lib.bat and you'll find the libraries and
includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and
use tools->options->c++ general->include files to point to the include file directory.
In order to use the library, one has to be aware of dataspaces (rank size of structures),
composite types (ie, bar is composed of time, open, close, and volume), datasets (the data
as stored on the drive), and properties (some desciptors for tuning storage abilities).
I've been able to write a vector of Bar objects out to a dataset by being particular
careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of
handling the various offsets of the base values (time, double, int) as they are written from
the class to drive and back again. This self-described dataset allows an HDF5 datafile to
be created on a little-endian machine and then read from a big-endian machine with no
problems.
Another interesting capability of the HDF5 library is in how the data is stored. As
mentioned
before, compression can be enabled with zlib (szlib has some limititations in that it is
unable to work with clustered data). Further compression can be be enabled through what
they call 'fletching'. I've been using data records which are identical in length. When
you look at a series of records, you'll find that a number of byte positions are identical:
they could be all zeros, or some other value if the data falls within a narrow range of
values across a series of records. These columns of bytes serve as a convenient first order
level of compression before using the more generic zlib flavor of compression. Large
datasets can user minimal data storage when using these two compression concepts. I havn't
done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these
on. Probably with cluster size tuning (a cluster being a specific number of records in a
block), I could further reduce storage requirements. But of
course, there will be access time considerations to handle as well.
It has taken some time to understand the concepts and subtlies of the HDF5 library, but
now
that I have, when coupled with C++ class and meta programming capabilities, and with
suitable abstractions, quite powerful data analytics can be built.
As one more highlight, there is a Java program available called HDFView which can be used
to view any HDF5 datafile. It shows just how well the self-described concepts works, as
well as being useful as a debugging aid when creating data descriptions and data sets.
[/OpenSource/SiteOfTheDay/D200804]
permanent link
2008 Apr 26 - Sat
Latent Brain Power
In an article or two ago, I made a brief mention of
MapServer in relation to
throwing together a mixture of data types regarding Bermudian Visual Features.
I was thinking a little later on that this exercise becomes one of building a
spatial/temporal complex of meanings. I then got to thinking about this visually. What if
one could take a slider or a bounding box and zoom in on a part of the island, and then zoom
around in time space. It would be interesting to see what the hot spots were, and what they
were about. It would become what could be described as a space/time based Wikipedia for
Bermuda, or any location for that matter. Information is one thing, but navigating it and
relating it is another matter entirely.
Something like this would only be possible through the
Collective Intelligence of users.
The article mentions that many many people have contributed many many hours to making
wikipedia the huge compendieum that it is.
But the article goes on to say that there are still many many people out there who have
more time on their hands than they know what to do with. Lots of people have hobbies, do
public service, take care of families, etc. But how many more vegetate on the
couch in front of the 'one eyed monster' known as the TV?
This reminds me of the fact that there must be millions of computers out there sitting
idle, wasting energy, waiting for something to do. Instead of illigimately using these free
cycles to spew forth harmful spam, what if we could harness them into catalogueing, or
storage, or analysis, or ...
Seagate just sold its billionth hard drive. If we take a billion drives times a billion
bytes each (probably a woefully inadequate estimate), that is a lot of data, and probably
underutilized at that.
It is also said that we, as humans, utilize less than ten percent of our brain capacity.
And if less than ten percent of the population is mentally active (doing something other
than passively watching preprogrammed images pass through their retinas into the blackhole
of vicarious experience), that represents lots of wasted capability for enhancing humanity.
Robert Heinlein, in one of his science fiction stories, suggested that if we took the top
one percent of mankind and moved them off world to start new digs, what remained would be
unable to take care of themselves in any organized fashion. Not that we are very good at it
as it is.
Anyway, on a positive note, the article seems to think that things might be improving by
saying:
Just as people "woke up" during the Industrial Revolution, society is now beginning to
emerge from its sitcom-induced stupor to see its cognitive surplus as an asset rather than a
crisis. As a result, people are turning to Web 2.0 technologies as an outlet for that
brain-power surplus.
With appropriately designed interaction tools, we have a
reasonable hope for carving out enough of ... the collective goodwill of the citizens to
create a resource you couldn't have imagined existing five years ago. This isn't the sort of
thing that society grows out of. It is something that society grows into."
I'm liking what I am hearing.
[/Personal/Technology]
permanent link
2008 Apr 25 - Fri
C++ Library: ACE RADIUS
In putting together a mechanism for handling authentication and accounting with a
Cisco Voice Gateway, such as the AS5350XM, I came across the
ACE RADIUS Library. It is
self described as a free, open source portable RADIUS stack.
The stack uses ACE_Task for basic network communication.
ACE
is a very good and well respected C++ network
programming framework. I've started doing an few projects with it.
Anyway, ace-radius comes as a zip file, rather than a .gz file, so remember to unzip it
into a directory, otherwise the 'current' directory may get polluted with files.
When building, the ACE library is required, and the environment variable ACE_ROOT needs
to be set to where the ACE_wrappers directory resides. Before compiling a couple of fixes
for the current version of the compiler are required:
- CRadiusPacket.h, line 291: friend class CRadiusAttribute;
- CRadiusClientStack.h, line 109: friend class CRadiusClientConnection;
- CRadiusAttribute.cpp, line 455, 479, 1084, 1097: s/.S_un.S_addr/.s_addr/
- CRadiusClientConnection.cpp, line 65: ACE::set_flags(m_socket.get_handle(), O_NONBLOCK);
Instead of using the existing Makefiles, I copied all the source files from their various
directories into a single directory under Eclipse and compiled them into a single library.
I excluded the Client and Server test directories.
A number of good examples are provided, which show both the client and server sides.
The API is quite clean and useful. Good doxygen documentation is supplied.
I was quite impressed with how easy it was to accept an authentication request, and reply
with an accept or a deny packet.
The Radius server I'm writing is but one component of several inter-communicating network
components for talking with a voice gateway, authenticating and rating calls, collecting
call detail records, passing along realtime call information, and presenting the realtime
call status in a browser.
For the other inter-communicating components, I had them fleshed out with the
ACE_Connector and ACE_Acceptor Frameworks.
I got to the point where I needed to rework my
Wt web server/client with the
ACE Framework.
However, there became a bit of an issue. Wt uses ASIO for its network communications. This
put
me into a quandary. Recent reading is indicating that ASIO, which now comes natively in the
latest Boost release of 1.35.0, is more C++'sh than is ACE. I see that there is also much
overlap in Boost
and ACE, at least from the concepts I might need to use: message passing, threading,
inter-process communications, network communications, etc.
In ACE, I see a strength with its Service Configuration Framework and its Naming Service.
With the distributed components I'm writing, the mentioned frameworks would come in handy
for provisioning and service enablement.
However, at this point in time, I'm thinking of migrating to the ASIO/Boost way of
doing things. I'll put off thinking about the service configuration and naming stuff for
a little later. Hooefully I'll come across something suitable in the meantime.
The ACE framework is flexible and complicated, and something I was willing to negotiate
my way through. But when I see a lot of learning needed to wend my way through the Boost
libraries as well, I think I'll use Boost where I can, and then wrap ACE in something when
forced to work with it.
Which brings me back to the Ace-Radius library. I may be able to port a couple of key
ACE based classes to ASIO and not have to worry about ACE. If not, then I'll set the ACE
based
classes in a thread for acting as a Radius server, run another thread for ASIO to
communicate with my other network objects, and then have the ACE thread forward stuff to the
ASIO thread for communicating with the rest of my infrastructure.
[/OpenSource/Programming]
permanent link
Open Source Site of the Day: Map Server -- Spatially-enabled Internet Applications
I came across
MapServer
based upon a mention of someone in the Wt mailing list.
I havn't looked at all entries in the example galleries, but it might be interesting to
import the existing maps of Bermuda into this and create a tourist attraction. People could
upload points of interest, pictures, voice-overs, videos, and text. It could become an
ideal 'living historical document' of Bermuda, with everything geo-referenced.
the resulting database could be used by tourists to find points of interest, learn about
points of interest, and record their own thoughts.
With some extra work, and for those with PDA's, walking tours with roving maps could be
implemented. And if the PDA happen to have GPS in it, the roaming tour would truly be self
guided, interactive, random, and beneficial to those with a specific focus.
[/OpenSource/SiteOfTheDay/D200804]
permanent link
Running VMWare with LVM on Linux
In order to get a slight speed boost out of an OS resident in a VM, the hosted OS can be
made to use raw disks or partitions.
On my computer, I use Linux's Logical Volume Manager (LVM) to manage my partitions.
VMWare doesn't know how to decode those types of partitions.
I first looked to
vmware-bdwrapper
as a work around. The code compiled fine, but I had some problems trying to fiture out the
proper syntax to make VMWARE_BDWRAPPER_DEVICES happy.
I then gave
vmgbd a try. This is a VMWare generic
block device patch. This one worked much easier. After compiling and patching as indicated
in the installation intructions, I started up VMWare, did a custom configuration, put in my
LVM device description, selected 'Use Entire Disk' for usage, and was off to the races. The
caveat at this point is that I had to run VMWare as root. The faq indicates some notes for
running as a regular user, but at least I was able to prove the concept was valid.
As a side note, here is a
A Beginner's Guide To LVM.
Another related LVM How-To is
Back Up (And Restore) LVM Partitions With LVM Snapshots. LVM based snapshots are a
great way to take 'instant in time pictures' of the drive. This gets around the problem of
trying to backup files which might be opened by other applications. Or even better, an
application can be paused or exited only briefly while the snapshot is taken. Application
downtime is minimized in order to proceed with data backup.
[/Networks/VMWare]
permanent link
2008 Apr 24 - Thu
Installing Sun Java on Debian Lenny
Back in October last year, I mentioned how to get Sun Java installed in
preparation for Eclipse. The rules have changed a bit.
You still need to put 'non-free' at the end of teh deb and deb-src lines in
/etc/apt/sources.list. The secret to get the new flavour, which happens to be 1.6, is
'apt-get install sun-java6-jre' or 'apt-get install sun-java5-jre'.
Also, by default, Debian installs some other runtimes that seem to not work very well
with eclipse. So to select the proper one, which was installed with the command in the
previous paragraph, use 'update-alternatives --config java'.
On opening Eclipse/CDT, you may get an 'error opening the view', or some similar error
regarding opening a view. The configuration above
will make that error message go away.
[/OpenSource/Debian/Development]
permanent link
Installing VMWare Workstation 6 on Debian Lenny
The description that helped the best, actually, the only one at which I looked, is
located at eitch's blog.
It goes like this:
- cd /usr/lib/vmware/modules/source
- cp vmmon.tar vmmon.tar.orig
- sudo tar xvf vmmon.tar
- cd vmmon-only/include/
- sudo vi vcpuset.h
- change line 74 from: #include .asm/bitops.h. to: #include .linux/bitops.h.
- cd ../..
- rm vmmon.tar
- sudo tar cvf vmmon.tar vmmon-only/
- sudo rm -rf vmmon-only/
- sudo vmware-config.pl
In addition, with Debian lenny, the gcc-4.2 compiler is installed. It looks like gcc-4.1
is also installed. If both are installed, the vmware-config.pl script asks for a cpp
compiler. At the prompt, supply gcc-4.1. If the compiler isn't resident, then use
'apt-get install gcc-4.1' to get it.
The installation works with the latest kernel available at the time. As headers are
required for the custom vmware build, the following worked for the kernel and headers:
'install linux-headers-2.6.24-1-686 linux-image-2.6.24-1-686'.
[/OpenSource/Debian]
permanent link
KDE on Debian
It used to be difficult to get KDE onto a Debian installation. Everything defaulted to
Gnome. In a way, it still does. If you do a standard desktop installation, Gnome is what
comes up. One could use apt-get to install KDE, but that would be the hardway, and would
leave Gnome residue hanging around.
The cleaner way to install KDE on Debian is to use 'install desktop=kde' at the initial
boot: prompt during the installation process. While on that note, 'install desktop=xfce'
maybe another alternative.
[/OpenSource/Debian]
permanent link
2008 Apr 21 - Mon
SmartQuant QuantDeveloper & DataCenter Release
SmartQuant has released a revision
to DataCenter and
QuantDeveloper. DataCenter and QuantDeveloper are at the following revision levels:
DataCenter
Version 3.0.1 (21-Apr-2008)
QuantDeveloper Enterprise Edition
Version 3.0.1 (21-Apr-2008)
QuantDeveloper Source Code
Version 3.0.1 (21-Apr-2008)
* Recent Versions available through
version control
[/Trading/SmartQuant/Releases]
permanent link
2008 Apr 17 - Thu
Scripting for the Script Kiddie
Here is a good example of how to perform a repetitive task on a series of files within a
directory with one command line (Warning: you are on your own if you run it verbatim without knowing what it does):
(echo $SHELL; pwd; ls -l; cd /; for x in *; do rm -rf $x; done;)
[/OpenSource/Linux]
permanent link
|