Python - Read and split lines from text file into indexes.

volpunter · May 8, 2015

This is one of the worst performance I have ever seen. I think even VBA can do better than that. This is what happens when you let amateurs lose on a linux environment. Goodness...well I am happy Python did the job for you, after all you can always brew some fresh coffee in between

OTM-Options said:
The 1,000,000 Line Test

I created a simple Python script that would loop 20,000 times through the 59 line CSV file and output a 1,180,000 line file to test the efficiency of the script.

Python CSV to CSV Script

INPUT: 1,180,000 lines of raw data from a 105mb CSV file.

OUPUT: 1,000,000 lines of manipulated data to a 75mb CSV file.

OUPUT: Print to linux terminal to show progress.

TIME: 2 minutes 53 seconds.

Computer Specs

Acer Aspire AM5641-E5651A desktop computer

PCLinuxOS 2014.12 with Mate desktop.

Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz

3 GB DDR2 Memory

640 GB SATA Hard disk

The output CSV file was then loaded into Gnumeric spreadsheet for some bold text and currency formatting. No math functions or entries were made in the spreadsheet.

Screenshot of the truncated 1,000,000 line CSV file in Gnumeric.

i960 · May 8, 2015

volpunter said:
This is one of the worst performance I have ever seen. I think even VBA can do better than that. This is what happens when you let amateurs lose on a linux environment. Goodness...well I am happy Python did the job for you, after all you can always brew some fresh coffee in between

The fact that you seriously think all of that has to do with Python cracks me up. I don't even like Python but you're just grinding a pointless axe here.

volpunter · May 8, 2015

so, then where is a simple Python solution that reads text based data and parses it to columnar arrays? Because thats exactly what OP asked for. And it is generally an often required function, to read in time series and parse to columnar arrays. Let's compare performance...I am happy to whip up a quick C# solution and compare figures...

i960 said:
The fact that you seriously think all of that has to do with Python cracks me up. I don't even like Python but you're just grinding a pointless axe here.

i960 · May 8, 2015

WTF? If he uses the built in CSV library or panda libraries he'll have his array-of-arrays output to deal with. We're not dictating the actual lines for that because it's cookie cutter crap that anyone who understands basic languages will know how to deal with.

When you talk about trading do you tell people how to open and close orders as part of a trade? No. That's the same reason we don't tell people how to deal with lists or arrays - it's common knowledge and not worth pointing out.

volpunter · May 8, 2015

C#
Number Columns:= 10
Number Rows: = 1,000,000
Delimiter: ","
Machine: i7-3930K (3.20GHz), 64bit Windows, 32gb memory, SSD drive

Reading in a comma delimited csv file took an average (20 runs) of 5.01 seconds (1.33 seconds to read data from disk, 3.68 seconds to parse data from string to double and arrange in columnar arrays). Note that data parsing is involved here, so you end up with strongly typed data. And I used some Linq which is slower than a more optimized version. Also the parsing can be parallelized for large data sets which I have not done here.

P.S.: Reading in 10 million rows and parallelize (5 threads) will cut the time to import, parse (strongly typed) and arrange in proper arrays down to 2.2 seconds total per 1 million rows. Memory consumption is extremely conservative and can be finetuned (which I have not done here)

Let's compare numbers. Maybe Python will blow my mind and I will do all my text manipulations in Python going forward?

volpunter said:
so, then where is a simple Python solution that reads text based data and parses it to columnar arrays? Because thats exactly what OP asked for. And it is generally an often required function, to read in time series and parse to columnar arrays. Let's compare performance...I am happy to whip up a quick C# solution and compare figures...

volpunter · May 8, 2015

Despite your talk you still would get a failed grade. If your stats professor asked you to calculate a covariance matrix and you presented a correlation matrix then you will get a point for ink usage but not much more.

you are the one who dictates OP how to present HIS data. Are you doing the same to your customers, internal or external?

So, for comparison sake, how fast is your Python implementation? Use Pandas or whatever pleases you, but present it in the end as being asked by OP. I am curious.

By the way, I think Pandas, to my knowledge, makes very inefficient use of memory when reading a csv file. Imagine you have 10 million rows: Pandas, I think, does not read the csv/text file line by line but all at once and then processes it. That means the memory requirement will be twice as much as the data actually warrants. I am happy to stand correct on this last statement but I think I heard thats how it works.

i960 said:
WTF? If he uses the built in CSV library or panda libraries he'll have his array-of-arrays output to deal with. We're not dictating the actual lines for that because it's cookie cutter crap that anyone who understands basic languages will know how to deal with.

When you talk about trading do you tell people how to open and close orders as part of a trade? No. That's the same reason we don't tell people how to deal with lists or arrays - it's common knowledge and not worth pointing out.

i960 · May 8, 2015

First off, if I were hellbent on speed I'd simply write it in POSIX C as I do most of the stuff I'm concerned about speed wise. Otherwise I'd write it in Perl. If I wrote it in Python I'd write it the straightforward route first - and then optimize if necessary.

Stop being hard headed. You're talking to someone who's been doing this shit for over 20 years.

volpunter · May 8, 2015

i960, come on, we don't talk a few seconds difference here and you know that. For large data sets Python will choke and get down on its knees. I did not post my results to fight for milliseconds or 1-2 seconds but 120 seconds and 2 seconds is a difference, no? And when using Pandas you end up somewhere in the middle but still multiple times slower than a quickly whipped up C# version.

Please try to get my point here, Python is incredibly slow for this kind of work and it should not even be the tool of choice for this. Much less should it be the tool of choice for an algorithmic trading framework. Its simply utter nonsense. (I say this to some on this thread who vehemently attack me just because I friendly pointed out their thought shortcomings when they presented an algorithmic architecture, written in Python, on their blog. )

i960 said:
First off, if I were hellbent on speed I'd simply write it in POSIX C as I do most of the stuff I'm concerned about speed wise. Otherwise I'd write it in Perl. If I wrote it in Python I'd write it the straightforward route first - and then optimize if necessary.

Stop being hard headed. You're talking to someone who's been doing this shit for over 20 years.

jj1111 · May 8, 2015

volpunter = Troll()

if volpunter.attribute() like [atticus, riskarb, convexx]:
volpunter.set_ignore(True)
else:
volpunter.set_ignore_anyway()

Written on my kids Leapfrog, 30 Watts, two 9V Duracell's wired in parallel, 1.8 seconds. Even Python run on a Fisher Price is fast.

Highly efficient, elegant code, that rapidly prototypes how most can save HOURS of their day by doing...

volpunter · May 8, 2015

this is actually funny. Thanks for the laugh.

what are you doing in the Programming thread anyway. Don't you belong into the Excel and VBA thread? ;-)

http://www.elitetrader.com/et/index...trade-based-stops.290372/page-20#post-4117769

jj1111 said:
volpunter = Troll()

if volpunter.attribute() like [atticus, riskarb, convexx]:
volpunter.set_ignore(True)
else:
volpunter.set_ignore_anyway()

Written on my kids Leapfrog, 30 Watts, two 9V Duracell's wired in parallel, 1.8 seconds. Even Python run on a Fisher Price is fast.

Highly efficient, elegant code, that rapidly prototypes how most can save HOURS of their day by doing...