Idea in extracting the title of an article?

I have a bunch of academic papers on the computer that I need organising.
I need to extract the titles of them, but have not found a valid method yet.
Any idea?
Usually, the title has the largest font in the first page, so I used python (and pdfminer module) to do so, but it is only working 50-60%.

Here is a crude method that works sometimes:
Code:
#!/bin/bash

if [[ $# -ne 1 ]]
then
    echo 1>&2 "Usage: $0 file.pdf
    finds the pdf file's title, and outputs it to the standard output"
    exit 1
fi

pdftotext -layout "${1}" - |
perl -n -e 'use warnings; use strict;
our $processingTitle; our @title;
my $l = $_;
$l =~ s/^\f//; # remove feeds
if ( $processingTitle )
{
    if ( $l =~ /^\s+$/ )
    {
        last;
    }

    my ( $parts ) = $l =~ /^\s*(\S.+)/;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
elsif ( $l =~ /^\s*(\S.+)/ )
{
    $processingTitle = 1;
    my $parts = $1;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
END {
if ( scalar(@title) == 0 )
{
    print "NO TITLE FOUND\n";
}
else
{
    print join(" ", @title), "\n";
}
}'

For example, assuming the environment has bash, pdftotext, and perl:
https://assets.super.so/e46b77e7-ee...iles/2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Code:
$ findpdftitle 2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Commodity Option Implied Volatilities and the Expected Futures Returns
 
For the archives :-) :

blueraincap.png
 
Here is a crude method that works sometimes:
Code:
#!/bin/bash

if [[ $# -ne 1 ]]
then
    echo 1>&2 "Usage: $0 file.pdf
    finds the pdf file's title, and outputs it to the standard output"
    exit 1
fi

pdftotext -layout "${1}" - |
perl -n -e 'use warnings; use strict;
our $processingTitle; our @title;
my $l = $_;
$l =~ s/^\f//; # remove feeds
if ( $processingTitle )
{
    if ( $l =~ /^\s+$/ )
    {
        last;
    }

    my ( $parts ) = $l =~ /^\s*(\S.+)/;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
elsif ( $l =~ /^\s*(\S.+)/ )
{
    $processingTitle = 1;
    my $parts = $1;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
END {
if ( scalar(@title) == 0 )
{
    print "NO TITLE FOUND\n";
}
else
{
    print join(" ", @title), "\n";
}
}'

For example, assuming the environment has bash, pdftotext, and perl:
https://assets.super.so/e46b77e7-ee...iles/2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Code:
$ findpdftitle 2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Commodity Option Implied Volatilities and the Expected Futures Returns
And what if he has no clue of Linux/Unix, bash ? :)
 
After 27 years of running this site, I thought I had seen it all, but then you come along and get the award for being the most disrespectful, lowest-vibration dumbass of them all.

I actually feel sorry for you more than anything else because the universe is never going to reward you by putting out negative energy like that. As the last post you're ever going to make on this site, I wish you good luck with your search and your life moving forward because you're damn sure going to need it.

Until some new nick with 3 posts and 0 likes shows up and open up a new thread asking "so, I have a friend that needs coding help" lol
 
When saving such PDF downloads, I always also save the web page where I found it.
Web pages of course have a filename as well, and this one mostly is a real title, not something cyptic like 194567.pdf :).
Now, later when I watch the directory listing sorted by datetime, I can see what the accompanying html doc was, and also open it in the browser, and so find out what the PDF is about, incl. title etc. and wherefrom I did download it...
 
Back
Top