Using pdfminer to parse out title of a paper

As much as the title, author, date, etc are apparent to the human eyes, I was thinking how to parse out such data for a folder of papers downloaded from [SSRN](https://www.ssrn.com/index.cfm/en/). Most papers don't have metadata properly in them, so must be parsed from the title-page.

Obviously enough, most titles tend to have the largest font on the title page, so I try to identify the text positions having the max font size and assume they represent the title. Trying on a few files and it works fine but not always.

Author names don't seem to follow any particular pattern, so I don't know how to parse. Any idea?

Code:
import sys
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTLine, LTChar

def getTitle(filepath):
    """method to extract the title of a pdf file"""
 
    try:
        filepath = filepath.strip()
    except TypeError:
        print("filepath not a string")
   
    if not filepath[len(filepath)-4:len(filepath)] == ".pdf":
        sys.exit("filepath not ending with .pdf")
     
   
    data = extractFirstPageElements(filepath)
    texts = data[0]
    fonts = data[1]

    if len(texts) == 0: #in case no text is read, either the pdf text is not readable or first page is blank
        title = "Unknown"
    else:
        maxFontPos = extractTitlePos(fonts)
        uncleanedTitle = extractTitle(texts, maxFontPos)
        title = cleanTitle(uncleanedTitle)
 
    return title


def extractFirstPageElements(filepath):
    """helper method to extract text elements and their corresponding font sizes into lists"""

    fonts = []
    elements = []
    for page in extract_pages(filepath):
        for element in page:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            font_size = character.size
                            break
                fonts.append(font_size)
                elements.append(element.get_text())
        break
    return [elements, fonts]


def extractTitlePos(fonts):
    """helper method to extract the positions having the largest font size"""
    font_pos = []
    maxFont = max(fonts)
    for pos, size in enumerate(fonts):
        if size == maxFont:
            font_pos.append(pos)
    return font_pos


def extractTitle(elements, positions):
    """helper method to extract those elements having the largest font size, then return as a joint string"""
    title = []
    for i in positions:
        title.append(elements[i])
    return "".join(title)


def cleanTitle(title):
    """helper method to clean the title by removing \n and illegal filename symbols"""
    title = title.strip() #remove whitespaces
    title = title.replace("\n", " ") #remove any inline \n
    title = title.replace(":", " -") #replace invalid filename : symbol
    title = title.replace("?", " ") #replace invalid filename ? symbol
    return title
 
Here is a quick and dirty way to parse document titles with bash, pdftotext, and perl
Code:
for f in SSRN*.pdf
do
    echo
    echo "${f}"
    pdftotext -layout "${f}" - |            # convert pdf to text keeping original layout
    perl -n -e 'use warnings; use strict;
    our @t; # holds the title
    my $line = $_;
    if ( ($line =~ /^\f/) && (scalar(@t) > 0) )
    {
        # form feed after finding title
        print join(" ", @t), "\n"; exit(0);
    }
    $line =~ s/^\s+//; $line =~ s/\s+$//;   # remove leading and trailing white space
    if ( ($line eq "") && (scalar(@t) > 0) )
    {
        # blank line after finding title
        print join(" ", @t), "\n"; exit(0);
    }
    push (@t, $line);   # save next part of title
    '
done

Example run on Windows 10 with cygwin (would probably work on Linux too):
SSRN-id1307643.pdf
Financial Astrology: Mapping the Presidential Election Cycle in US Stock Markets

SSRN-id1447443.pdf
Exercises in Advanced Risk and Portfolio Management R (ARPM) with Solutions and Code, supporting the 6-day intensive course ARPM Bootcamp

SSRN-id2140091.pdf
Demystifying Time-Series Momentum Strategies: Volatility Estimators, Trading Rules and Pairwise Correlations∗

SSRN-id264513.pdf
Spectral Analysis of Economic Time Series Behaviour

SSRN-id3184092.pdf
Dynamic Alpha: A Spectral Decomposition of Investment Performance Across Time Horizons∗

SSRN-id566882.pdf
Technical Analysis in Financial Markets

SSRN-id715301.pdf
A Simplified Approach to Understanding the Kalman Filter Technique
 
Here is a quick and dirty way to parse document titles with bash, pdftotext, and perl
Code:
for f in SSRN*.pdf
do
    echo
    echo "${f}"
    pdftotext -layout "${f}" - |            # convert pdf to text keeping original layout
    perl -n -e 'use warnings; use strict;
    our @t; # holds the title
    my $line = $_;
    if ( ($line =~ /^\f/) && (scalar(@t) > 0) )
    {
        # form feed after finding title
        print join(" ", @t), "\n"; exit(0);
    }
    $line =~ s/^\s+//; $line =~ s/\s+$//;   # remove leading and trailing white space
    if ( ($line eq "") && (scalar(@t) > 0) )
    {
        # blank line after finding title
        print join(" ", @t), "\n"; exit(0);
    }
    push (@t, $line);   # save next part of title
    '
done

Example run on Windows 10 with cygwin (would probably work on Linux too):
SSRN-id1307643.pdf
Financial Astrology: Mapping the Presidential Election Cycle in US Stock Markets

SSRN-id1447443.pdf
Exercises in Advanced Risk and Portfolio Management R (ARPM) with Solutions and Code, supporting the 6-day intensive course ARPM Bootcamp

SSRN-id2140091.pdf
Demystifying Time-Series Momentum Strategies: Volatility Estimators, Trading Rules and Pairwise Correlations∗

SSRN-id264513.pdf
Spectral Analysis of Economic Time Series Behaviour

SSRN-id3184092.pdf
Dynamic Alpha: A Spectral Decomposition of Investment Performance Across Time Horizons∗

SSRN-id566882.pdf
Technical Analysis in Financial Markets

SSRN-id715301.pdf
A Simplified Approach to Understanding the Kalman Filter Technique

Can you briefly explain the logic?
 
Can you briefly explain the logic?

In psuedocode:
Code:
for each pdf file,
    Print the file name to make the output easier to follow.
    Convert a pdf file to text with pdftotext command keeping the layout of the document.
    Save lines from the text until a line starts with a form feed or just has white space.
    Print the saved lines
    Skip to next pdf file.
 
Back
Top