Quote from nitro:
I see, so your point is, [queries] has to be parallizable over many columns and tables [on a join], and just doing the column oriented part is only half the problem?
Column oriented solves only the reading from the disk problem - to read only the needed data and read them faster because of better compression. This works only if you have an algorithm for faster decompression, better if it is a parallel decompression.
Yes they have to be properly grouped to feed the SIMD registers for max performance (for example Structures of Arrays), properly sliced together to fit CPU cache, arranged in an appropriate memory access patterns to get the max performance.
I believe we have a powerful computers which are still underutilized. Vectorized SIMD operations were introduced 10 years ago but still not utilized well. If you look at the .ASM code generated from MS or Intel C++ compilers you can see even if you set the /O2 option for max speed, the SIMD operations generated are far from optimal. Too often the code generated accesses the memory many times but if you do it properly you can store these data temporarily in SIMD registers. For reuse. Accessing memory being much slower than accessing registers hurts performance. And that is when you work with C++. If you work on .NET it is getting worse because you have a lot overhead, context switching and garbage collection. I made extensive benchmarks on increments of 1MB of data and the time for processing is raising exponentially. When data gets ~100MB in size it gets very slow and not practical to use. Performance meter shows .NET actually is using available processors (2 CPU's on my server) and is doing a higher level parallelization but it is still very slow for large data sets. Too much overhead because it is a general purpose not custom tailored solution.
If this happens for vectorized instructions, what to tell about multi-core and multi-processor optimizations? Try to get the new Intel or MS parallelizing compilers and see how hard is to make it work properly. Also the performance gain to get from 2 cores is not comparable from what you can get from NVIDIA CUDA capable GPU's where you can get up to 240 processors. Personally I still haven;t heard of COTS database capable of doing all of this.
Just do some benchmarks yourself on your own data on different increments, workloads, computations and don't believe to the conventional useless benchmarks.
We coded the same algorithm on both MS tools and Sybase tools - 10 different siding windows back regressions, 10 days gain computation back and after each trading day, plus proprietary factor data to be able to create screeners.
The number of data generated for each trading day is ~1,000 (columns) and it is about 200MB uncompressed for 1928-2009 DJIA. Up to 140MB data computations (half of this data) last for ~1 hour using MS tools, ~5 minutes using Sybase tools limited to 1 CPU.
Sybase is much better, has more formulas embedded, there is not exponential growth in the computational time, it is almost linear, but for maximum utilization of 2 CPUs for example you pay two times more.
Also even 5 min processing time is a lot if you want to make quick strategies testing. The screeners (queries) are slow too because it is not column oriented. Sybase IQ is column oriented but is about $40K and needs a database for storage. Sybase RAP trading solution with all options- DB, IQ, etc. comes together about $100K.
The max configurations for MS tools (Enterprise version) is about $25K (unlimited CPUs), Sybase SQL Anywhere for 2 CPUs is $40K. That is without support option. 2 CPU configured machine is a reasonable today. I don't see how they can cram more CPUs in a server without burning it to flames - too much power consumption. So that is what you get - too slow for the $'s. The ROI is questionalble.
The test machine was:
Dell PowerEdge 2850 Server, 64-bit MS Server 2008;Intel Xeon 2x3.00 Ghz; 4GB RAM
We are exploring now NVIDIA CUDA capable GPUs. The most powerful (240 processors) use a lot power too and need a computer capable to handle ~800W. There are a capable mid range professional GPUs with more memory with a good mix of memory/computation/PCIe bandwidth - for example we are testing now FX 1800 - 64 processors, 800MB memory, only 50W. Converting algorithms to it is not easy but the speedup we get for some of the algorithms is ~100 times.
They can do this because these processors are very simple and thus can be packed in much more quantity. Currently they are still using 65nm process and the most powerful of them dissipate a lot of power, but moving to 45 nm will make them better.