Idea in extracting the title of an article?

BMK · Mar 16, 2024

semperfrosty said:
He actually seems like a normal,intelligent guy,but every now and again:

I would say off meds.

He also references in this thread 'suddenly' buying a property in the UK.Then is in a fluster about the funds.

Possibly Bipolar with some mania and impulse issues followed by a crash.

Drawdown Addict · Mar 16, 2024

BMK said:
Would not have this problem if he had properly named the files when he saved them.

We need an app you can install on your Windows box that will force you to name files in compliance with a strict, user-defined file naming convention, i.e., stops you from saving the file unless you rename it in a way that complies with your local policy. Then you'll never have trouble locating or identifying a file again.

This will be the next killer app. I'm partnering with Adam Neumann to develop it. SoftBank has agreed to fund our startup with $57 million. We are leasing space at WeWork for our team of developers. This will change how people save files forever.

I will never have to work again.

Disclosure: This post is a parody.

I can tell you are being sarcastic, but that is not far from something many companies need.
I can't stress enough how many times I had to go through never-ending files, confluence pages, wiki lists, and analysts' PBIs. We waste an incredible amount of time at work trying to understand what we have to achieve at the end of the day.

Just because the documentation base is rubbish. Something at OS level that forces people to name files consistently would help many companies.

vanzandt · Mar 16, 2024

Baron said:
The process I'm referring to is a manual process, not automated. You open up each paper, copy the title of it, and then rename the file by pasting the title as the new filename.

So for example, you might have a paper with a filename of 2022hft.pdf and when you open that file you see that the paper is titled "HFT Activity and Overview in 2022", so you would copy that and rename the file HFT_Activity_and_Overview_in_2022.pdf

That has got to be the dumbest idea of all time. Wtf do you know about coding?! @Baron on ignore! I hope Bitcoin goes to $1!

Hey for real though, I was like Quanto and thought this guy has to be your buddy or something... I couldn't believe what I read. And ya know whats really funny Baron, you are usually so cool and run this place such class... I actually thought to myself... "Watch Baron take the extreme ultimate, bigger man win here, and not do anything," but yeah, that remark of his was so far out of left field... its like "whaaaaat the f?!"

I don't think that one can be attributed to booze or weed. I've known some pretty nasty folks, but not one of them, no matter how stoned they were would walk into someone's house and take a sh*t right in the middle of the living room. Piss in the pool maybe, but geez... whatever.

semperfrosty · Mar 17, 2024

blueraincap said:

I have a bunch of academic papers on the computer that I need organising.
I need to extract the titles of them, but have not found a valid method yet.
Any idea?
Usually, the title has the largest font in the first page, so I used python (and pdfminer module) to do so, but it is only working 50-60%.

Code:

import sys
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTLine, LTChar

#this module contains one public method to parse the title of a typical PDF academic paper, given its filepath
#the parsing often fails because of difficulty in parsing PDF files

def getTitle(filepath):
    """main method to extract the title of a pdf file"""

    try:
        filepath = filepath.strip()
    except TypeError:
        print("filepath not a string")

    if not filepath[len(filepath)-4:len(filepath)].lower() == ".pdf":
        sys.exit("filepath not ending with .pdf in " + filepath)
 
    try:
        data = extractFirstPageElements(filepath)
        texts = data[0]
        fonts = data[1]
   
        maxFontPos = extractTitlePos(fonts)
        uncleanedTitle = extractTitle(texts, maxFontPos)
        title = cleanTitle(uncleanedTitle)

    except: #mostly due to IndexError as no texts are read in due to unreadable file or blank page, or some readTexts-fontSize mismatch
        title = "Unknown"

    if len(title) > 120: #ultra-long title likely error
        title = "Unknown"

    return title


def extractFirstPageElements(filepath):
    """helper method to extract text elements and their corresponding font sizes into lists"""

    fonts = []
    elements = []
    for page in extract_pages(filepath):
        for element in page:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            font_size = character.size
                            break #first character only, assuming others in text_line have the same size
                fonts.append(font_size)
                elements.append(element.get_text())
        break #read first page only
    return [elements, fonts]


def extractTitlePos(fonts):
    """helper method to extract the positions having the largest font size"""
    font_pos = []
    maxFont = max(fonts)
    for pos, size in enumerate(fonts):
        if size == maxFont:
            font_pos.append(pos)
    return font_pos


def extractTitle(elements, positions):
    """helper method to extract those elements having the largest font size, then return as a joint string"""
    title = []
    for i in positions:
        title.append(elements[i])
    return " ".join(title)


def cleanTitle(title):
    """helper method to clean the title by removing \n and illegal filename symbols"""

    englishArticles = ("a", "an", "the") #to remove if start word of title
           
    title = title.strip() #remove whitespaces
    title = title.replace("\n", " ") #remove any inline \n
    title = title.replace(":", " -") #replace invalid file : symbol
    title = title.replace("?", " ") #replace invalid file ? symbol
    title = title.replace("*", "") #replace invalid symbol
    title = title.replace("@", "") #replace invalid symbol
    title = title.replace("/", " ") #replace invalid symbol
    title = title.replace("  ", " ") #remove potential double whitespaces
    title = title.title() #capitalize each word

    #remove starting article if one
    firstWord = title.split()[0].lower() #first word of title
    if firstWord in englishArticles:
        secondWord = title.split()[1] #second word of title
        secondWordPos = title.index(secondWord) #where second word starts
        title = title[secondWordPos:] #remove the starting article off title

    #perform word capitalization, some keywords should be in all lower-case or upper-case, regular words are letter-capitalized
    words = title.split() #individual words in the title, each letter-capitalized

    #var to hold words that should be all lower-case
    lowercaseWords = ("a", "an", "the", "at", "to", "from", "for", "using", "of", "among", "across", "during","what", "with", "and", "or", "between", "in", "on", "is", "are", "as", "there", "under", "toward", "towards", "through", "via", "by", "based", "vs", "versus", "its", "it", "their")

    #var to hold words that should be all upper-case
    uppercaseWords = ("us", "usa", "eu", "uk", "hk", "nyse", "ftse", "hkse", "pca", "etf", "etfs", "fx", "ipo", "hft", "spx", "vix", "vxx", "adr", "adrs")

    capAdjustedWords = [] #var to hold capitalization-appropriate words

    for word in words: #check each word one by one
        if word.lower() in lowercaseWords:
            capAdjustedWords.append(word.lower()) #capitalize to lower-case
        elif word.lower() in uppercaseWords:
            capAdjustedWords.append(word.upper()) #capitalize to upper-case
        else:
            capAdjustedWords.append(word) #no capitalization
    if capAdjustedWords[0].islower():
        capAdjustedWords[0] = capAdjustedWords[0].capitalize() #captailize in case first word is in lowercaseWords
    title = " ".join(capAdjustedWords)

    #often the final char is some special character so should be removed
    finalChar = title[ len(title)-1 : ]
    if not finalChar.isalnum():
        title = title[0 : len(title)-1] #remove the last char
   

    return title


if __name__ == "__main__":
    filepath = input("Enter the file path (using /): ")
    print(getTitle(filepath))

Have you tried turning it off and then back on again?

Idea in extracting the title of an article?

BMK

Drawdown Addict

vanzandt

semperfrosty