HDF5 vs. text file for storing data?

Batman28 · Jan 29, 2009

I hear alot about using HDF5 for tick stoing data for back-testing etc. but is it not far simpler reading/writing to a simple .txt file? one can easily load data into memory for analysis and save data too, so the question is what are the advantages of HDF5 vs. normal text file?

appreciate any serious input, thanks

377OHMS · Jan 29, 2009

HDF5 is a binary format, nice small files. Very good for huge data.

Nothing wrong with Ascii files but HDF5 is more appropriate for really large data.

Batman28 · Jan 29, 2009

Quote from 377OHMS:

HDF5 is a binary format, nice small files. Very good for huge data.

Nothing wrong with Ascii files but HDF5 is more appropriate for really large data.

could you pls elaborate? I mean how significant would this be in storing tick data, let's say 1 year 1min data?

Also are there any functionality advantages?

thanks

377OHMS · Jan 29, 2009

Well, the files read into Matlab with just a few lines of code and read in almost instantaneously. Some of my ascii processes can take a minute or two to open large files (~750Meg).

HDF5 files are easier to amend as each entry is a field with attributes. Changing a few rows in an ascii file using Matlab is difficult and usually people resort to perlscript if they are autonomously amending large ascii data files.

Dunno, thats all I can think of right now cause I just woke up.

Just search google for hdf5 and you'll see some good format descriptions. You might also consider SQL as an alternative. Nice to be able to query into the data from automated processes.

erdewit · Jan 29, 2009

I've played with HDF5 through it's Python
bindings (called PyTables) and found it to be
unsuitable for a tick database.

First of all, it is slower then just working with
regular files.

Second, the HDF5 database is one huge file that
is easily corruptible. For example, if you accidently
let two processes have write access to
the database then it will become corrupted.
Repairing didn't always work for me so then all
data would be lost.

Third, HDF5 is a hierarchical database where
objects are retrieved via a path-like key. This is
the same as with a regular filesystem and offers
no advantage whatsoever over a just a plain
regular filesystem.

The only usecase for HDF5 is when working with
datasets that are too large to fit into memory.
Tick data does not fall into this catagory.

What I am using is just simple files. One tick file
per instrument per day. It's easy to see what's
going on, it's easy to compress files and make
incremental backups. Reading speed is 1 M ticks/s
for text files and 10-20 M ticks/s for binary files.

377OHMS · Jan 29, 2009

Quote from erdewit:

I've played with HDF5 through it's Python
bindings (called PyTables) and found it to be
unsuitable for a tick database.

First of all, it is slower then just working with
regular files.

Second, the HDF5 database is one huge file that
is easily corruptible. For example, if you accidently
let two processes have write access to
the database then it will become corrupted.
Repairing didn't always work for me so then all
data would be lost.

Third, HDF5 is a hierarchical database where
objects are retrieved via a path-like key. This is
the same as with a regular filesystem and offers
no advantage whatsoever over a just a plain
regular filesystem.

The only usecase for HDF5 is when working with
datasets that are too large to fit into memory.
Tick data does not fall into this catagory.

What I am using is just simple files. One tick file
per instrument per day. It's easy to see what's
going on, it's easy to compress files and make
incremental backups. Reading speed is 1 M ticks/s
for text files and 10-20 M ticks/s for binary files.

Mostly nonsense.

I work with terrabytes of 20Hz data. HDF5 is universally preferred for large data. It is not suitable as a database file. You read the entire thing into high core and then work on it. As I said, you might be better off with an SQL database but if you are just bulk storing large data most people in the scientific community use HDF5.

Batman28 · Jan 29, 2009

Quote from erdewit:

I've played with HDF5 through it's Python
bindings (called PyTables) and found it to be
unsuitable for a tick database.

First of all, it is slower then just working with
regular files.

Second, the HDF5 database is one huge file that
is easily corruptible. For example, if you accidently
let two processes have write access to
the database then it will become corrupted.
Repairing didn't always work for me so then all
data would be lost.

Third, HDF5 is a hierarchical database where
objects are retrieved via a path-like key. This is
the same as with a regular filesystem and offers
no advantage whatsoever over a just a plain
regular filesystem.

The only usecase for HDF5 is when working with
datasets that are too large to fit into memory.
Tick data does not fall into this catagory.

What I am using is just simple files. One tick file
per instrument per day. It's easy to see what's
going on, it's easy to compress files and make
incremental backups. Reading speed is 1 M ticks/s
for text files and 10-20 M ticks/s for binary files.

That sounds good - can you describe your set-up, what language you use to read/write the files? do you any specific frameworK/IDE?

Thanks in advance

377OHMS, I'm totally not interested in SQL. It's just useless for time series data the way I see it.. anyways what do you exactly do using terrabytes of data? is this tick data? what sort of analysis do you do if you don't mind me asking, thanks

Batman28 · Jan 29, 2009

Quote from 377OHMS:

Mostly nonsense.

I work with terrabytes of 20Hz data.

20Hz??! what on earth are you doing ?

dsss27 · Jan 29, 2009

Quote from Batman28:

20Hz??! what on earth are you doing ?

Based on the significance of 377 Ohms, likely Beta waves or maybe a close harmonic of the Schumann Resonance

sorry I couldn't resist!

erdewit · Jan 29, 2009

Quote from Batman28:

That sounds good - can you describe your set-up, what language you use to read/write the files? do you any specific frameworK/IDE?

As language I'm using Python, with performance critical parts implemented in C. The C extensions are not written by hand but generated by courtesy of Cython. The general design and ideas do not depend on it though.

In my design there is an abstract DataStore that gets implemented by a FileStore, a Hdf5Store and a SqlStore. The FileStore has two modes: text and binary. I use the text filestore for capturing ticks during the day because simple text files are just so damn reliable. I can do a tail -f on a file and see the ticks scroll by. The text files are zipped at the end of the day and backed up.

The zipped text files are too slow for backtesting though so I use another filestore with binary files for caching. The binary files are memory mapped (mmap) for ultimate speed.

I don't use the SqlStore because it's too slow and I don't use the Hdf5Store because it gets corrupted too easily.

k377: Not sure what you think is nonsense. I'm not dissing HDF5 in general, only it's use in a tick database.