One Unified Global Perspective
Communications with a Global Perspective
Home
Intro
Contact Us
Voice over IP
PBX Solutions
Services
Support
Glossary
Open Source
Blog
Forum

WebMail





2008 Apr 27 - Sun

HDF Group's Hierarchical Data Format (HDF5) Library

I've been working with HDF5 Group's HDF (Hierarchical Data Format) library for the last little while. It is a mechanism for managing self-described data collections, no matter how large or complicated. From their website, here are a few features:

  • A versatile data model that can represent very complex data objects and a wide variety of metadata.
  • A completely portable file format with no limit on the number or size of data objects in the collection.
  • A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
  • A rich set of integrated performance features that allow for access time and storage space optimizations.
  • Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

I'm using the HDF5 library in a stock market research and trading platform I'm developing in C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these data types uses ptime from the Boost DateTime library for time referencing.

I've been able to use C++'s container and iterator concepts to write a read/write container with appropriate custom random iterator capabilities. This allows me to use STL (Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range to quickly search for selected sub-ranges of the various data types.

From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release, and have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release appears to have link problems. The web pages show downloads for 1.8.0, but with a little extra digging, there is a HDF5 snapshot server available.

Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding the build documentation, which is located in the /release_docs directory of the extraction. I used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent versions of Winzip or 7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.

One key point is to download both zlib and szlib and put them in directories, otherwise the HDF5 library won't build. Two environment variables are required:

  • HDF5_EXT_SZIP=szlibdll.lib
  • HDF5_EXT_ZLIB=zlib1.lib

To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the windows/proj/all/all.sln file, select build/debug/library options and then build the solution. After the build, run installhdf5lib.bat and you'll find the libraries and includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and use tools->options->c++ general->include files to point to the include file directory.

In order to use the library, one has to be aware of dataspaces (rank size of structures), composite types (ie, bar is composed of time, open, close, and volume), datasets (the data as stored on the drive), and properties (some desciptors for tuning storage abilities).

I've been able to write a vector of Bar objects out to a dataset by being particular careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of handling the various offsets of the base values (time, double, int) as they are written from the class to drive and back again. This self-described dataset allows an HDF5 datafile to be created on a little-endian machine and then read from a big-endian machine with no problems.

Another interesting capability of the HDF5 library is in how the data is stored. As mentioned before, compression can be enabled with zlib (szlib has some limititations in that it is unable to work with clustered data). Further compression can be be enabled through what they call 'fletching'. I've been using data records which are identical in length. When you look at a series of records, you'll find that a number of byte positions are identical: they could be all zeros, or some other value if the data falls within a narrow range of values across a series of records. These columns of bytes serve as a convenient first order level of compression before using the more generic zlib flavor of compression. Large datasets can user minimal data storage when using these two compression concepts. I havn't done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these on. Probably with cluster size tuning (a cluster being a specific number of records in a block), I could further reduce storage requirements. But of course, there will be access time considerations to handle as well.

It has taken some time to understand the concepts and subtlies of the HDF5 library, but now that I have, when coupled with C++ class and meta programming capabilities, and with suitable abstractions, quite powerful data analytics can be built.

As one more highlight, there is a Java program available called HDFView which can be used to view any HDF5 datafile. It shows just how well the self-described concepts works, as well as being useful as a debugging aid when creating data descriptions and data sets.

[/OpenSource/SiteOfTheDay/D200804] permanent link


2008 Apr 25 - Fri

Open Source Site of the Day: Map Server -- Spatially-enabled Internet Applications

I came across MapServer based upon a mention of someone in the Wt mailing list.

I havn't looked at all entries in the example galleries, but it might be interesting to import the existing maps of Bermuda into this and create a tourist attraction. People could upload points of interest, pictures, voice-overs, videos, and text. It could become an ideal 'living historical document' of Bermuda, with everything geo-referenced.

the resulting database could be used by tourists to find points of interest, learn about points of interest, and record their own thoughts.

With some extra work, and for those with PDA's, walking tours with roving maps could be implemented. And if the PDA happen to have GPS in it, the roaming tour would truly be self guided, interactive, random, and beneficial to those with a specific focus.

[/OpenSource/SiteOfTheDay/D200804] permanent link


2008 Mar 24 - Mon

Open Source Site of the Day -- Processing

This should actually be filed under a few different headings (which is something I'll do once I get the new blog software in place). Those headings being programming, software development, visualization, animation, imaging, interacting, and open source.

It is some software called Processing of which I speak. It is hard to put into words what this is. I first took a look at some animation samples that were produced for a FOX station's movie slot. They need to be seen to be believed. My next step was to look at programming examples. The programming examples included visualization of programming techniques...self referential, eh! More advanced examples provided the ground work for how to do some of the flocking behaviour I viewed in the first animation samples I viewed.

This site caters to the young and the old, programmers and visualizers. If only for the eye-candy, it is a site to be viewed.

[/OpenSource/SiteOfTheDay/D200803] permanent link


2008 Feb 05 - Tue

Open Source Site of the Day -- SOCI: The C++ Database Access Library

Back in October 2007, I wrote an article about PostgreSQL, C++ and the lack of binary data and wrappers.

I stand corrected. On two counts. The first is binary access. After looking through the PostgreSQL API, I see that I can indeed get binary data into and out of a database easily. The C based library, libpq, has an API call named PQexecParams which allows full separation of binary values from the query statement. This is a good step towards preventing SQL injection attacks.

The second item upon which I stand corrected is a full binary C++ wrapper around PostgreSQL. I encountered one today: SOCI - The C++ Database Access Library. It handles Oracle, PostgreSQL, MYSQL, as well as a number of others. The library is templated and allows extensions of user defined binary values.

If I'm not mistaken, it also provides a capability similar to what LINQ does in CLR languages, inlining of SQL code. Or close enough that the difference does not matter.

While on the subject of databases, I'll make mention that, for a trading application in a Windows environment that I've been developing, I've been starting to use the HDF Group's HDF5 Database as a repository for historical trade information. It handles custom binary record types, handles compression on the fly with a clever byte pre-ordering strategy, with huge file sizes. I'll write more on it later once I've had a chance to break it in. This is a C++ based library and works in Windows as well as Linux environments. I'm also going to try this library out for managing SNMP based data from a custom network monitoring application.

SQL based engines are good for generic query applications, but when one wants a self contained application, it is tough to get a user to install or allow installation of a separate engine. I suppose Microsoft's embedded SQL engine is a counter example to that argument. But I was looking for something lighter in weight. For the trading application I mentioned earlier, I've been trying out the Berkeley DB C++ Library. It doesn't handle the overhead of SQL queries, but does all the low level table stuff, along with transactions and secondary indexes. The code is a little more complicated, but it makes for a nice light weight engine for storing queryable records. The HDF5 engine is good for storing large quantities of binary data, but lacks a query mechanism. I think the division of labour between the two storage mechanism for a trading and a network management application works well.

[/OpenSource/SiteOfTheDay/D200802] permanent link


2008 Jan 28 - Mon

Open Source Site of the Day -- TrueCrypt: Free open-source disk encryption software for Windows Vista/XP/2000 and Linux

When trying to keep things private and personal, and to carry things around securely on USB keys, or even hard drives, nothing beats the simplicity and flexibility of TrueCrypt. It is free and and it is Open Source. A new release is scheduled for February 4, 2008.

Here is a feature list from the main web page:

  • Creates a virtual encrypted disk within a file and mounts it as a real disk.
  • Encrypts an entire hard disk partition or a storage device such as USB flash drive.
  • Encryption is automatic, real-time (on-the-fly) and transparent.
  • Provides two levels of plausible deniability
  • Encryption algorithms: AES-256, Serpent, and Twofish. Mode of operation: LRW.

The software is well documented and has a good startup tutorial.

I had seen this software a while ago, and hadn't done much with it. I was recently reminded of this through A Nice Surprise, an article written by a SANS' NewsBytes Editor regarding data loss and theft. As the author says, with software so simple and easy, why isn't it used more often?

The same author has a page designated as the The Six Dumbest Ideas in Computer Security. In brief, the points are:

  • Default Permit
  • Enumerating Badness
  • Penetrate and Patch
  • Hacking is Cool
  • Educating Users
  • Action is Better Than Inaction

The article is well worth the read. It is an excellent dissertation on how one should change one's security philosophy to get at problems at the source, rather than attempting to make the symptoms go away.

[/OpenSource/SiteOfTheDay/D200801] permanent link


2007 Oct 06 - Sat

Open Source Site of the Day -- OSSWAD: Open Source Savvy Web Application Developer

An OSSWAD, almost sounds like a forbidden word. Based upon Bob Zurek's column, there are Osswads, and then there are OSSWADs. Many developers of web sites 'out there' already use Open Source tools to get the job done. Apache, MySQL, PostgreSQL, Perl, and PHP are commonly used tools by Osswads.

On the other hand, one gets the impression that real, enterprise savvy OSSWADs, use some real hard core tools like:

  • hadoop: large scale distributed compute clusters. Amazon has a developer resource center with a Hadoop/MapReduce Paper.
  • lucene: not just search software, but something that provides the basis for getting meaningfull context. There is an informative online book to go alone with it.
  • nutch: adds web specifics to lucene: web crawling, link-graphs, and parsers.

These Open Source projects fill in more of the gaps of the thinking I've been doing on providing some contextual search products in specific subject matter areas. It looks like I won't have to do as much ground up development as I thought I might need to do.

[/OpenSource/SiteOfTheDay/D200710] permanent link


2007 Aug 22 - Wed

Open Source Site of the Day -- Some Sites Referencing SNMP Sites

[/OpenSource/SiteOfTheDay/D200708] permanent link


2007 Aug 04 - Sat

Open Source Site of the Day -- Xen: OS Virtualization

VMWare is probably the pre-eminent hypervisor. Microsoft is trying to catch up with their VM. At one point, you couldnt' get Linux based guests to run (something that has now changed due to popular demand).

On the Open Source front, Xen appears to be doing well. And through contributions from both Intel and AMD, unmodified guest operating systems such as MS Windows can be run within a Xen environment. (Intel VT or AMD-V microprocessor capabilities required).

From an installation point of view, .bootstrap has an article Installing Xen on Debian Etch 4.0. Russell Coker has a couple of how-to's on installing Xen, to which I'll just reference through his Popular Posts page.

[/OpenSource/SiteOfTheDay/D200708] permanent link


Open Source Site of the Day -- PowerTOP: Saving Power with Linux on Intel Platforms

Intel spent some time analyzing the causes of power usage of Linux applications on laptops. To help users maximize laptop battery life, Intel came up with a helpful utility called PowerTOP.

Based upon what I've read, certain USB components and products are the worst contributors to bad battery life. Slashdot readers have various and sundry things to say about battery performance in general. Matthew has things to say about it in particular, especially about USB stuff not conforming to published specifications.

[/OpenSource/SiteOfTheDay/D200708] permanent link



Blog Content ©2008
Ray Burkholder
All Rights Reserved
ray@oneunified.net
(441) 505 7293
Available for Contract Work
Resume

RSS: Click to see the XML version of this web page.

View Ray 
Burkholder's profile on LinkedIn
technorati
Add to Technorati Favorites



September
Su Mo Tu We Th Fr Sa
  6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        


Main Links:
Monitoring Server
SSH Tools
QuantDeveloper Code

Special Links:
Frink

Blog Links:
Sergey Solyanik
Marc Andreessen
HotGigs
Micro Persuasion
... Reasonable ...
Chris Donnan
BeyondVC
lifehacker
Trader Mike
Ticker Sense
HeadRush
TraderFeed
Stock Bandit
The Daily WTF
Guy Kawaski
J. Brant Arseneau
Steve Pavlina
Matt Cutts
Kevin Scaldeferri
Joel On Software
Quant Recruiter
Blosxom User Group
Wesner Moise
Julian Dunn
Steve Yegge

2008
Months
Sep
Oct Nov Dec




Mason HQ

Disclaimer: This site may include market analysis. All ideas, opinions, and/or forecasts, expressed or implied herein, are for informational purposes only and should not be construed as a recommendation to invest, trade, and/or speculate in the markets. Any investments, trades, and/or speculations made in light of the ideas, opinions, and/or forecasts, expressed or implied herein, are committed at your own risk, financial or otherwise.