Idea in extracting the title of an article?

2rosy · Mar 14, 2024

be grateful. without threads like this there would be no ET

ph1l · Mar 14, 2024

blueraincap said:
I have a bunch of academic papers on the computer that I need organising.
I need to extract the titles of them, but have not found a valid method yet.
Any idea?
Usually, the title has the largest font in the first page, so I used python (and pdfminer module) to do so, but it is only working 50-60%.

Here is a crude method that works sometimes:

Code:

#!/bin/bash

if [[ $# -ne 1 ]]
then
    echo 1>&2 "Usage: $0 file.pdf
    finds the pdf file's title, and outputs it to the standard output"
    exit 1
fi

pdftotext -layout "${1}" - |
perl -n -e 'use warnings; use strict;
our $processingTitle; our @title;
my $l = $_;
$l =~ s/^\f//; # remove feeds
if ( $processingTitle )
{
    if ( $l =~ /^\s+$/ )
    {
        last;
    }

    my ( $parts ) = $l =~ /^\s*(\S.+)/;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
elsif ( $l =~ /^\s*(\S.+)/ )
{
    $processingTitle = 1;
    my $parts = $1;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
END {
if ( scalar(@title) == 0 )
{
    print "NO TITLE FOUND\n";
}
else
{
    print join(" ", @title), "\n";
}
}'

For example, assuming the environment has bash, pdftotext, and perl:
https://assets.super.so/e46b77e7-ee...iles/2f2fc428-925c-4041-9f6d-bc387d904820.pdf

Code:

$ findpdftitle 2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Commodity Option Implied Volatilities and the Expected Futures Returns

Quanto · Mar 14, 2024

For the archives :-)

:

hilmy83 · Mar 14, 2024

You either:

1. Lost money
2. Got cheated on
3. Off your meds
4. All the above..

blueraincap said:
Way more than the number of cocks you have sucked

Quanto · Mar 14, 2024

ph1l said:

Here is a crude method that works sometimes:

Code:

#!/bin/bash

if [[ $# -ne 1 ]]
then
    echo 1>&2 "Usage: $0 file.pdf
    finds the pdf file's title, and outputs it to the standard output"
    exit 1
fi

pdftotext -layout "${1}" - |
perl -n -e 'use warnings; use strict;
our $processingTitle; our @title;
my $l = $_;
$l =~ s/^\f//; # remove feeds
if ( $processingTitle )
{
    if ( $l =~ /^\s+$/ )
    {
        last;
    }

    my ( $parts ) = $l =~ /^\s*(\S.+)/;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
elsif ( $l =~ /^\s*(\S.+)/ )
{
    $processingTitle = 1;
    my $parts = $1;
    $parts =~ s/\s+$//;
    push (@title, $parts);
}
END {
if ( scalar(@title) == 0 )
{
    print "NO TITLE FOUND\n";
}
else
{
    print join(" ", @title), "\n";
}
}'

For example, assuming the environment has bash, pdftotext, and perl:
https://assets.super.so/e46b77e7-ee...iles/2f2fc428-925c-4041-9f6d-bc387d904820.pdf

Code:

$ findpdftitle 2f2fc428-925c-4041-9f6d-bc387d904820.pdf
Commodity Option Implied Volatilities and the Expected Futures Returns

And what if he has no clue of Linux/Unix, bash ?

Quanto · Mar 14, 2024

hilmy83 said:
You either:

1. Lost money
2. Got cheated on
3. Off your meds
4. All the above..

View attachment 336090

I think he was "high", ie. doped, was under the influence of drugs, and alcohol...

hilmy83 · Mar 14, 2024

Baron said:
After 27 years of running this site, I thought I had seen it all, but then you come along and get the award for being the most disrespectful, lowest-vibration dumbass of them all.

I actually feel sorry for you more than anything else because the universe is never going to reward you by putting out negative energy like that. As the last post you're ever going to make on this site, I wish you good luck with your search and your life moving forward because you're damn sure going to need it.

Until some new nick with 3 posts and 0 likes shows up and open up a new thread asking "so, I have a friend that needs coding help" lol

Quanto · Mar 14, 2024

When saving such PDF downloads, I always also save the web page where I found it.
Web pages of course have a filename as well, and this one mostly is a real title, not something cyptic like 194567.pdf

.
Now, later when I watch the directory listing sorted by datetime, I can see what the accompanying html doc was, and also open it in the browser, and so find out what the PDF is about, incl. title etc. and wherefrom I did download it...

ajacobson · Mar 14, 2024

Baron should ban the poster and at the bare minimum, we all should block him.

Zwaen · Mar 14, 2024

ajacobson said:
Baron should ban the poster and at the bare minimum, we all should block him.

It’s weird, he seemed like a normal guy, before. The world is changing