Python - Read and split lines from text file into indexes.

Pandas is taylor made for this kind of processing. I strongly encourage you to use it.

For example you could do the first two things with one line of code.
Third thing would be:

data.NewTotal - data.Total - data.Costs

Fourth thing would be:

data[data.Type =="Buy"]

The couple of hours you might spend understanding how to use pandas for these basic things will pay dividends many times over.

Thanks for all the replies. I have included the actual CSV data I am using to give you a better idea how it is formatted. My objective is:

  • Read the file line by line.
  • Break each line at the comma into it's own column or index- index[0] index[1] index[2] ...... index[9]
  • Preform basic math on some of the columns and put the results into a new column.
  • Only use rows that match a certain criteria, such as: index[2] == Buy,Sell or Expired.
  • Output only the columns I want to a new CSV file, also comma separated.

I have completed all of the above on one line hard coded into my script. Now I'm working on reading an entire file - line by line - then writing to a new file.




:)
 
Just to show how easy it is in pandas, this took me 2 minutes to do everything you wanted.

I removed the first and third line from your csv. The second remaining line then becomes the header line.

Code:
import pandas as pd
ans=pd.read_csv("temp.csv")
ans.head(5)

Out[1]:
  Transaction Date Settlement Date Activity Description  \
0       2014-05-09      2014-05-12                 Sell
1       2014-05-09      2014-05-12                 Sell
2       2014-05-08      2014-05-09                  Buy
3       2014-05-08      2014-05-09                 Sell
4       2014-05-07      2014-05-08                  Buy

               Description                      Symbol  Quantity  Price  \
0  POWERSHARES QQQ TR SR 1    CALL QQQ 2014MAY09 86.50        -3   0.16
1                    APPLE  CALL AAPL 2014MAY09 610.00        -2   0.01
2  POWERSHARES QQQ TR SR 1    CALL QQQ 2014MAY09 86.50         3   0.30
3             TESLA MOTORS  CALL TSLA 2014MAY09 240.00        -1   0.01
4             TESLA MOTORS  CALL TSLA 2014MAY09 240.00         1   0.67

  Currency  Total Amount Currency.1
0      USD         34.29        USD
1      USD          0.00        USD
2      USD       -103.70        USD
3      USD          0.00        USD
4      USD        -78.20        USD

ans.Quantity*ans.Price

Out[1]:
0  -0.48
1  -0.02
2  0.90
3  -0.01
4  0.67
5  0.98
6  0.39
7  -0.37
8  0.68
9  NaN

subans=ans[ans['Activity Description']=="Sell"]
print subans

Out[1]:
  Transaction Date Settlement Date Activity Description  \
0  2014-05-09  2014-05-12  Sell
1  2014-05-09  2014-05-12  Sell
3  2014-05-08  2014-05-09  Sell
7  2014-05-02  2014-05-05  Sell
10  2014-04-28  2014-04-29  Sell
18  2014-04-25  2014-04-28  Sell
20  2014-04-24  2014-04-25  Sell

  Description  Symbol  Quantity  \
0  POWERSHARES QQQ TR SR 1  CALL QQQ 2014MAY09 86.50  -3
1  APPLE  CALL AAPL 2014MAY09 610.00  -2
3  TESLA MOTORS  CALL TSLA 2014MAY09 240.00  -1
7  KEURIG GREEN MOUNTAIN COM  PUT GMCR 2014MAY02 92.00  -1
10  APPLE  CALL AAPL 2014MAY02 590.00  -1
18  BAIDU SPONSORED ADR REPSTG ORD  CALL BIDU 2014APR25 170.00  -1
20  APPLE  CALL AAPL 2014APR25 560.00  -1

  Price Currency  Total Amount Currency.1
0  0.16  USD  34.29  USD
1  0.01  USD  0.00  USD
3  0.01  USD  0.00  USD
7  0.37  USD  25.79  USD
10  5.88  USD  576.78  USD
18  0.04  USD  0.00  USD
20  8.13  USD  801.78  USD

subans.to_csv("/home/rsc/temp/newfile.csv")

I bet that is fewer lines than your script, and probably more readable than say index[3]*index[4]. And it works.

Thanks for all the replies. I have included the actual CSV data I am using to give you a better idea how it is formatted. My objective is:


    • Read the file line by line.
    • Break each line at the comma into it's own column or index- index[0] index[1] index[2] ...... index[9]
    • Preform basic math on some of the columns and put the results into a new column.
    • Only use rows that match a certain criteria, such as: index[2] == Buy,Sell or Expired.
    • Output only the columns I want to a new CSV file, also comma separated.
I have completed all of the above on one line hard coded into my script. Now I'm working on reading an entire file - line by line - then writing to a new file.



CSV FILE
Code:
Transaction Type=All,Product Type=All,Symbol=,From=2014-05-09,To=2014-12-31
Transaction Date,Settlement Date,Activity Description,Description,Symbol,Quantity,Price,Currency,Total Amount,Currency
----------------,---------------,--------------------,-----------,------,--------,-----,--------,------------,--------
2014-05-09,2014-05-12,Sell,POWERSHARES QQQ TR SR 1,CALL QQQ 2014MAY09 86.50,-3,0.16,USD,34.29,USD
2014-05-09,2014-05-12,Sell,APPLE,CALL AAPL 2014MAY09 610.00,-2,0.01,USD,0.00,USD
2014-05-08,2014-05-09,Buy,POWERSHARES QQQ TR SR 1,CALL QQQ 2014MAY09 86.50,3,0.30,USD,-103.70,USD
2014-05-08,2014-05-09,Sell,TESLA MOTORS,CALL TSLA 2014MAY09 240.00,-1,0.01,USD,0.00,USD
2014-05-07,2014-05-08,Buy,TESLA MOTORS,CALL TSLA 2014MAY09 240.00,1,0.67,USD,-78.20,USD
2014-05-06,2014-05-07,Buy,APPLE,CALL AAPL 2014MAY09 610.00,1,0.98,USD,-159.40,USD
2014-05-06,2014-05-07,Buy,APPLE,CALL AAPL 2014MAY09 610.00,1,0.39,USD,0.00,USD
2014-05-02,2014-05-05,Sell,KEURIG GREEN MOUNTAIN COM,PUT GMCR 2014MAY02 92.00,-1,0.37,USD,25.79,USD
2014-04-30,2014-05-01,Buy,KEURIG GREEN MOUNTAIN COM,PUT GMCR 2014MAY02 92.00,1,0.68,USD,-79.20,USD
2014-04-29,2014-04-29,Expired,BAIDU SPONSORED ADR REPSTG ORD,CALL BIDU 2014APR25 170.00,1,,,0.00,USD
2014-04-28,2014-04-29,Sell,APPLE,CALL AAPL 2014MAY02 590.00,-1,5.88,USD,576.78,USD
2014-04-28,2014-04-28,Expired,BAIDU SPONSORED ADR REPSTG ORD,CALL BIDU 2014APR25 170.00,-1,,,0.00,USD
2014-04-28,2014-04-28,Expired,SPDR DOW JONES INDL AVERAGE ET,CALL DIA 2014APR25 165.00,-2,,,0.00,USD
2014-04-28,2014-04-28,Foreign exchange,SELL USD @ 1.0870,,,,,163.05,CAD
2014-04-28,2014-04-28,Foreign exchange,SELL USD @ 1.0870,,,,,-150.00,USD
2014-04-25,2014-04-25,Foreign exchange,SELL USD @ 1.0860,,,,,-500.00,USD
2014-04-25,2014-04-28,Buy,APPLE,CALL AAPL 2014MAY02 590.00,1,0.72,USD,-83.20,USD
2014-04-25,2014-04-25,Foreign exchange,SELL USD @ 1.0860,,,,,543.00,CAD
2014-04-25,2014-04-28,Sell,BAIDU SPONSORED ADR REPSTG ORD,CALL BIDU 2014APR25 170.00,-1,0.04,USD,0.00,USD
2014-04-24,2014-04-25,Buy,SPDR DOW JONES INDL AVERAGE ET,CALL DIA 2014APR25 165.00,2,0.24,USD,-60.45,USD
2014-04-24,2014-04-25,Sell,APPLE,CALL AAPL 2014APR25 560.00,-1,8.13,USD,801.78,USD
2014-04-24,2014-04-25,Buy,BAIDU SPONSORED ADR REPSTG ORD,CALL BIDU 2014APR25 170.00,1,1.06,USD,-117.20,USD
2014-04-22,2014-04-23,Buy,APPLE,CALL AAPL 2014APR25 560.00,1,1.09,USD,-120.20,USD



:)
 
Pandas is taylor made for this kind of processing. I strongly encourage you to use it.

For example you could do the first two things with one line of code.
Third thing would be:

data.NewTotal - data.Total - data.Costs

Fourth thing would be:

data[data.Type =="Buy"]

The couple of hours you might spend understanding how to use pandas for these basic things will pay dividends many times over.




Thanks globalarbtrader. I'm going to try and avoid any modules, including the math module if I can. Below is a what I have so far, it's 60% complete. I hard coded one line of data into the script and it writes the manipulated line to a new CSV file - now I have to replace that line with the code to open a file and loop through all the lines.


Python Code
Code:
#!/usr/bin/python

csvOut = "TransactionHistoryOutput.csv"
counter = 1
split = ","
join = ","

header = "Row,Transaction Date,Buy/Sell,QTY,Security,Price,Amount,Commission,Total Amount,Currency"
line = "2014-04-30,2014-05-01,Buy,KEURIG GREEN MOUNTAIN COM,PUT GMCR 2014MAY02 92.00,1,0.68,USD,-79.20,USD"

items = line.split(split)
if items[2] == "Buy" or items[2] == "Sell" or items[2] == "Expired":

   Row= '{0:03d}'.format(counter)
   TransactionDate = items[0]
   BuySell = items[2]
   QTY = items[5]
   Security = items[4]
   Price = items[6]
   TotalAmount = items[8]
   Currency = items[9]
   counter = counter + 33

   price = float(Price)
   total = abs(float(TotalAmount))
   com = abs(price * 100 - total)
   Commission = str(com) #Price * 100 - TotalAmount = Commissions
  
   mnt = total - com
   Amount = str(mnt) # TotalAmount - Commission = Amount
  
dataOutput = [Row + join + TransactionDate + join +  BuySell + join + QTY + join + Security + join + Price + join + Amount + join + Commission + join + TotalAmount + join + Currency]
dataOutput = ''.join(dataOutput)

def writeFile(data):
   f = open(csvOut,'a')
   f.write(data + "\n")
   f.close()

writeFile(header)
writeFile(dataOutput)
print dataOutput

input("\n\nPress the Enter key to exit.")


:)
 
Thanks globalarbtrader. I was editing my post #13 when you posted all the code in post #12. I'm going out now and will review it in a few hours.



:)
 
This is highly advanced so you might need to study theory behind whats going on here

download python https://store.continuum.io/cshop/anaconda/
and use the pandas library

Code:
import pandas

listOflists=list(pandas.read_csv("FILE.csv").to_records())
Pandas is
Thanks globalarbtrader. I'm going to try and avoid any modules, including the math module if I can. Below is a what I have so far, it's 60% complete. I hard coded one line of data into the script and it writes the manipulated line to a new CSV file - now I have to replace that line with the code to open a file and loop through all the lines.


Python Code
Code:
#!/usr/bin/python

csvOut = "TransactionHistoryOutput.csv"
counter = 1
split = ","
join = ","

header = "Row,Transaction Date,Buy/Sell,QTY,Security,Price,Amount,Commission,Total Amount,Currency"
line = "2014-04-30,2014-05-01,Buy,KEURIG GREEN MOUNTAIN COM,PUT GMCR 2014MAY02 92.00,1,0.68,USD,-79.20,USD"

items = line.split(split)
if items[2] == "Buy" or items[2] == "Sell" or items[2] == "Expired":

   Row= '{0:03d}'.format(counter)
   TransactionDate = items[0]
   BuySell = items[2]
   QTY = items[5]
   Security = items[4]
   Price = items[6]
   TotalAmount = items[8]
   Currency = items[9]
   counter = counter + 33

   price = float(Price)
   total = abs(float(TotalAmount))
   com = abs(price * 100 - total)
   Commission = str(com) #Price * 100 - TotalAmount = Commissions

   mnt = total - com
   Amount = str(mnt) # TotalAmount - Commission = Amount

dataOutput = [Row + join + TransactionDate + join +  BuySell + join + QTY + join + Security + join + Price + join + Amount + join + Commission + join + TotalAmount + join + Currency]
dataOutput = ''.join(dataOutput)

def writeFile(data):
   f = open(csvOut,'a')
   f.write(data + "\n")
   f.close()

writeFile(header)
writeFile(dataOutput)
print dataOutput

input("\n\nPress the Enter key to exit.")


:)

Pandas is the right tool for the job.
Avoiding the modules is a mistake.
Not only is the pandas DataFrame easier to use when manipulating data, it is also faster than your script since it uses Cython to call C libraries in many cases.

If you want to make your own parser implementation and control all the low level details to make things run faster than pandas, then you'll need to use C/C++
 
Code:
 dataOutput = [Row + join + TransactionDate + join +  BuySell + join + QTY + join + Security + join + Price + join + Amount + join + Commission + join + TotalAmount + join + Currency]
dataOutput = ''.join(dataOutput)

:)
You really confirm every prejudice one has about php programmers.
Unbelievable you refuse to use the proper libraries.
 
OK ...... I checked out Pandas and the Python CSV module.

Pandas
It doesn't come with the standard Python installation so I couldn't run any tests, and I couldn't find any examples on line that interested me. It is overkill for my needs so I won't bother with it.


Python CSV module
It comes with the standard Python installation and lots of examples on line. But the CSV manipulation I require can be accomplished without it and probably just as efficiently. I do plan on using the Python CSV module on my next project.


Using Python version 2.7.6 and the tasks I wanted my script to perform.
  • No editing of original CSV file. Any lines or empty data fields that can crash the script must be handled in the script.
  • Reverse the lines of the original CSV file before processing. Earliest transactions at the top, not the bottom.
  • Split each line at the commas and assign each column an index[] variable.
  • Delete lines that do not contain certain values such as "Buy", "Sell", or "Expired".
  • Delete unwanted columns and create new columns with values from some of the remaining columns.
  • Add a line counter, new column headers and footer labels.
  • Total the value of some of the columns and get P/L percentage.
  • Format the timestamp from 2014-05-06 to Tue May 06.
  • Change negative values into positive values in the quantity column.
  • Change 0 values to nothing in the debit, credit and commission columns.
  • Output the results to a new CSV file.

My completed 100% working script using only the datetime module.

Code:
#!/usr/bin/python
# Python version 2.7.6

import datetime

csv_in = "TransactionHistory_22523594.csv"
csv_reverse = "temp_reverse.csv"
csv_out = "transaction_history.csv"
header = "Row,Date,Buy/Sell,QTY,Security,Price,Debit,Credit,Commission,Total Amount,Currency" + '\n'
footer_commission = 0
footer_debit = 0
footer_credit = 0
footer_total_amount = 0
counter = 1
split = ","
join = ","

def write_file(data,write_to):
  f_out = open(write_to,'a')
  f_out.write(data + "\n")
  f_out.close()

def reverse_file(in_put):
  for line in reversed(list(open(in_put))):
  rows = (line.rstrip())
  write_file(rows,csv_reverse)

reverse_file(csv_in)
write_file(header,csv_out)

f_in = open(csv_reverse, "r")
for line in f_in.xreadlines():
  if len(line.strip()) != 0 :
  line = line.strip()
  parts = line.split(split)
  if parts[2] == "Buy" or parts[2] == "Sell" or parts[2] == "Expired":
  row= '{0:03d}'.format(counter)
  transaction_date = parts[0]
  buy_sell = parts[2]
  qty = parts[5]
  security = parts[4]
  price = parts[6]
  if price == "":
  price = "0"
  total_amount = parts[8]
  currency = parts[9]
  transaction_date = datetime.datetime.strptime(transaction_date, "%Y-%m-%d").strftime("%a %b %d")
  qty = abs(int(qty))
  price = float(price)
  total_amount = float(total_amount)
  amount = qty * price * 100
  amount = int(amount)

  t = abs(float(total_amount))
  if parts[8] >= "0" and  parts[8] <= "1":
  commission = 0
  else:
  commission = abs(t - amount)
  if total_amount <= 1:
  debit = amount
  credit = 0
  else:
  debit = 0
  credit = amount
  if debit > t:
  credit = debit
  commission = debit
  debit = 0

  footer_debit = footer_debit + debit
  footer_credit = footer_credit + credit
  footer_commission = footer_commission + commission
  footer_total_amount = footer_total_amount + total_amount

  qty = str(abs(qty))
  price = str(price)
  debit = str(debit)
  if debit == "0":
  debit = ""
  credit = str(credit)
  if credit == "0":
  credit = ""
  commission = str(commission)
  if commission == "0":
  commission = ""
  total_amount = str(total_amount)
  counter = counter + 1

  data_out = row + join + transaction_date + join + buy_sell + join + qty + join + security + join \
  + price + join + debit + join + credit + join + commission + join + total_amount + join + currency

  print data_out
  write_file(data_out,csv_out)

pl = footer_debit + footer_commission
pl_percent = (footer_credit - (pl)) / pl * 100
footer_debit = str(footer_debit)
footer_credit = str(footer_credit)
footer_commission = str(footer_commission)
footer_total_amount = str(footer_total_amount)
pl_percent = str(pl_percent)

footer = '\n' + join + join + join + join + join + "Subtotal" + join + footer_debit + join + footer_credit + join \
  + footer_commission + join + footer_total_amount + join + currency + '\n' + '\n' + join + join + join \
  + join + join + join + join + "P/L" + join + pl_percent + " %" + join + footer_total_amount + join + currency

write_file(footer,csv_out)

input("\n\nPress the Enter key to exit.")



:)
 
Are you seriously adding the 'join' string rather than using the join *function*? That is not the way to do things.


Looks like I picked a variable name that closely resembles a function. I assigned the comma to the join and split variable to make the script easier to change to handle different characters in the CSV file.

Example: I can output using ZZZZZZZ to join the lines instead of a comma. Then input using the same ZZZZZZZ or whatever character(s) I choose to split the lines.



:)
 
Last edited:
Back
Top