Python - Read and split lines from text file into indexes.

volpunter · May 14, 2015

quite impressive. Aside the memory issues for large files pandas seems to be quite capable in importing text files.

It takes me 1.7s running your code. You seem to be running on a pretty good SSD drive. Btw, I tested within Anaconda/Spyder on top of Python 3.4. This comes close to the 1.3s it takes C# to import the data (the rest is taken up with parsing the data into strong types)

Quiet1 said:
I revised the test to use the sample data the OP posted. It's now actually faster.

The code is in the attached text file. It generates the test data, saves it to a file (all outside pandas), reads it into a pandas dataframe then strips out the row below the headers. It's a bit faster still if you just supply the headers and tell pandas to ignore the first 3 rows.

So revised timings:
1 million rows ~ 1.15s (edit: originally said 1.27s but that was including saving csv out)
10 million rows ~ 10.75s (edit: originally said 14.5s but that was including saving csv out)

Surprising result to me. Will try the windows/old SSD at some point soon-ish.

Q1

Quiet1 · May 16, 2015

So on Windows (8.1) on same machine but with original 3yo SSD I get 1.25s (worst of 3) to read a million lines (compared to 1.15s on Ubuntu).

Both Pythons are 3.4 via Anaconda with Jetbrains PyCharm (by far the best Python IDE imho).

Like for like, Python can read the file into a list (without any parsing) in ~ 0.56s on windows.

volpunter · May 16, 2015

Really Quite performant, no question. Though some of your numbers do not really seem to add up. You earlier said it took you 1.27s with saving the CSV out and 1.15s without on your Linux box. The performance number to export the data to CSV seems extremely unrealistic.

Quiet1 said:
So on Windows (8.1) on same machine but with original 3yo SSD I get 1.25s (worst of 3) to read a million lines (compared to 1.15s on Ubuntu).

Both Pythons are 3.4 via Anaconda with Jetbrains PyCharm (by far the best Python IDE imho).

Like for like, Python can read the file into a list (without any parsing) in ~ 0.56s on windows.

Quiet1 · May 16, 2015

double checking the time to save now on my laptop (ie not same machine) I get about 0.17s to save the data to csv - meaning time from before file-open to after file-close only. So seems ballpark ok to me given i'd expect my desktop to be faster.

volpunter · May 16, 2015

Thanks for checking. According to your performance metric and assuming a floating point number of size 8 bytes would amount to at least a required throughput of 440mb/sec which is definitely within the specs of most modern ssd drives.

Again, quite respectable hardware performance (we have by now almost moved away from actual Python performance and into hardware prrformance) but also thanks for showing that Python and Pandas is quite performant in respect to raw text ex and import (among others).

Quiet1 said:
double checking the time to save now on my laptop (ie not same machine) I get about 0.17s to save the data to csv - meaning time from before file-open to after file-close only. So seems ballpark ok to me given i'd expect my desktop to be faster.

volpunter · May 16, 2015

btw, I like Spyder and also Visual Studio as it brings everything together for me on my Windows machine.

Quiet1 said:
So on Windows (8.1) on same machine but with original 3yo SSD I get 1.25s (worst of 3) to read a million lines (compared to 1.15s on Ubuntu).

Both Pythons are 3.4 via Anaconda with Jetbrains PyCharm (by far the best Python IDE imho).

Like for like, Python can read the file into a list (without any parsing) in ~ 0.56s on windows.

Butterfly · May 16, 2015

OTM-Options said:

I have made some changes to the Python CSV to CSV script:

Added a format_currency function to change 12345.67 into $12,345.67 and -12345.67 into ($-12,345.67).
Instead of calling the write_file function with each loop I have now concatenated each loop together into a string and then write to file at the end of the script.
Only the row counter is printed to the terminal while the script is running, instead of all the data. This has made a big improvement on the performance of the script.
The 1,000,000 Line Test now comes in at 1 minute 28 seconds, down from 2 minutes 53 seconds. More info: The 1,000,000 Line Test

Updated CSV to CSV script. 133 lines and 4.7Kb in size.

Code:

#!/usr/bin/python
# Python version 2.7.6

import datetime
import time

def timer(label):
  ts = time.time()
  st = datetime.datetime.fromtimestamp(ts).strftime('%H:%M:%S')
  timer_out = open("timer.txt",'a')
  timer_out.write(label + ": " + st + "\n")
  timer_out.close()

timer("Start")

def format_currency(value):
  if value < 0:
  results = '${:.2f}'.format(value)
  value = "(" + results + ")"
  return value
  else:
  return '${:.2f}'.format(value)

csv_in = "TransactionHistory_22523594.csv"
csv_out = "transaction_history.csv"
header = "Row,Date,Buy/Sell,QTY,Security,Price,Debit,"\
  "Credit,Commission,Total Amount,Currency" + '\n'

footer_commission = 0
footer_debit = 0
footer_credit = 0
footer_total_amount = 0
counter = 1
split = ","
join = ","
row_out = ""

def write_file(row,write_to):
  f_out = open(write_to,'a')
  f_out.write(row + "\n")
  f_out.close()

write_file(header,csv_out)

for line in reversed(list(open(csv_in))):
  if len(line.strip()) != 0 :
  line = line.strip()
  column = line.split(split)
  if column[2] == "Buy" or column[2] == "Sell" or column[2] == "Expired":
  row_counter = '{0:03d}'.format(counter)
  transaction_date = column[0]
  buy_sell = column[2]
  qty = column[5]
  security = column[4]
  price = column[6]
  if price == "":
  price = "0"
  total_amount = column[8]
  currency = column[9]
  transaction_date = (datetime.datetime.strptime\
  (transaction_date, "%Y-%m-%d").strftime("%a %b %d"))
  qty = abs(int(qty))
  price = float(price)
  total_amount = float(total_amount)
  amount = qty * price * 100
  amount = int(amount)

  abs_total_amount = abs(float(total_amount))
  if column[8] >= "0" and  column[8] <= "1":
  commission = 0
  else:
  commission = abs(abs_total_amount - amount)
  if total_amount <= 1:
  debit = amount
  credit = 0
  else:
  debit = 0
  credit = amount
  if debit > abs_total_amount:
  credit = debit
  commission = debit
  debit = 0

  footer_debit = (footer_debit + debit)
  footer_credit = (footer_credit + credit)
  footer_commission = (footer_commission + commission)
  footer_total_amount = (footer_total_amount + total_amount)

  qty = str(abs(qty))
  price = format_currency(price)
  debit = format_currency(debit)
  credit = format_currency(credit)

  if debit == "$0.00":
  debit = ""
  if credit == "$0.00":
  credit = ""
  commission = format_currency(commission)
  if commission == "0":
  commission = ""
  total_amount = format_currency(total_amount)
  counter = counter + 1

  row = (row_counter + join + transaction_date + join + buy_sell + join \
  + qty + join + security + join + price + join + debit + join + credit + join \
  + commission + join + total_amount + join + currency + "\n")

  print row_counter
  row_out = row_out + row

pl = (footer_debit + footer_commission)
pl_percent = ((footer_credit - (pl)) / pl * 100)
pl_debit = format_currency(pl)
footer_debit = format_currency(footer_debit)
footer_credit = format_currency(footer_credit)
footer_commission = format_currency(footer_commission)
footer_total_amount = format_currency(footer_total_amount)
pl_percent = str(pl_percent)

join2x = (join + join)
join5x = (join + join + join + join + join)
footer = (join5x + "Subtotal" + join + footer_debit + join + footer_credit + join \
  + footer_commission + join + footer_total_amount + join + currency + "\n \n" \
  + join5x + join2x + "Total Debit" + join2x + pl_debit + join + currency + "\n" \
  + join5x + join2x + "Total Credit" + join2x + footer_credit + join + currency + "\n" \
  + join5x + join2x + "P/L" + join + pl_percent + " %" + join + footer_total_amount \
  + join + currency)

write_file(row_out,csv_out)
write_file(footer,csv_out)
timer("Finish")

input("\n\nPress the Enter key to exit.")

jesus, what a poorly written python code,

so unpython !!! you should be shot !!!

This is really a simple task, can't believe you can't even figure it out on your own. Sounds like you are a college student. That's the problem with the new Facebook generation, no programming skills.

Python - Read and split lines from text file into indexes.

volpunter

Quiet1

volpunter

Quiet1

volpunter

volpunter

Butterfly