Getting HTML via Browser OK, via wget ERROR. Why?

thecoder · Nov 5, 2021

Hi,
the following URL works in the web browser:
https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:

Code:

wget -O "earnings.html"  "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05"

What follows is the output of wget:

Code:

--2021-11-06 00:16:57--  https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
Connecting to 192.168.20.1:8118... connected.
Proxy request sent, awaiting response... 404 Not Found
2021-11-06 00:16:57 ERROR 404: Not Found.

wget normally functions well, but not with this URL. :-(
What's missing?

DaveV · Nov 5, 2021

Yahoo is probably trying to prevent webscraping. Try adding the wget option

"--user-agent=Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"

to trick Yahoo that the request is coming from a browser.

WiktorK · Nov 6, 2021

thecoder said:
Hi,
the following URL works in the web browser:
https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:

Code:

wget -O "earnings.html" "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05"

What follows is the output of wget:

Code:

--2021-11-06 00:16:57-- https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 Connecting to 192.168.20.1:8118... connected. Proxy request sent, awaiting response... 404 Not Found 2021-11-06 00:16:57 ERROR 404: Not Found.

wget normally functions well, but not with this URL. :-(
What's missing?

I may be wrong here, but you are trying to download a file when there is none. Yahoo does not provide anything "downloadable" here.

If you want to get the full content of this website, you can do so with:

Code:

curl https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 > output.html

This will output the full page content to a file, which can be later scrapped

thecoder · Nov 6, 2021

Both of the above suggestions by @DaveV and @WiktorK have worked here under Linux. Thx to all; saved me the day :-)

d08 · Nov 6, 2021

Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.

DaveV · Nov 6, 2021

d08 said:
Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.

I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.

d08 · Nov 6, 2021

DaveV said:
I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.

I'm sure Puppeteer looks fine as well. Never seen a website that has beaten Selenium as it's technically impossible, if an user can see it, it's able to get it. I've noticed people who claim it cannot scrape a site usually don't know how the site's javascript works.