Getting HTML via Browser OK, via wget ERROR. Why?

Hi,
the following URL works in the web browser:
https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:
Code:
wget -O "earnings.html"  "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05"
What follows is the output of wget:
Code:
--2021-11-06 00:16:57--  https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
Connecting to 192.168.20.1:8118... connected.
Proxy request sent, awaiting response... 404 Not Found
2021-11-06 00:16:57 ERROR 404: Not Found.
wget normally functions well, but not with this URL. :-(
What's missing?
 
Hi,
the following URL works in the web browser:
https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
But trying to get the html file programmatically with the command-line tool "wget" via a script file fails:
Code:
wget -O "earnings.html"  "https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05"
What follows is the output of wget:
Code:
--2021-11-06 00:16:57--  https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05
Connecting to 192.168.20.1:8118... connected.
Proxy request sent, awaiting response... 404 Not Found
2021-11-06 00:16:57 ERROR 404: Not Found.
wget normally functions well, but not with this URL. :-(
What's missing?
I may be wrong here, but you are trying to download a file when there is none. Yahoo does not provide anything "downloadable" here.

If you want to get the full content of this website, you can do so with:

Code:
curl https://finance.yahoo.com/calendar/earnings?from=2021-10-31&to=2021-11-06&day=2021-11-05 > output.html

This will output the full page content to a file, which can be later scrapped :)
 
Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.
 
Modern websites have all sorts of scraping protection. Previously Yahoo had a key value pair sent/received by javascript, maybe they've gotten rid of that. Selenium (not headless) comes to rescue to mimic actual human behavior.

I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.
 
I webscrape over 2,000 web pages a day to update my database. In the past I used Selenium, but switched two years ago to Google's Puppeteer. I have yet to find a webpage that Puppeteer cannot handle. I even have one site where I have to click 4 buttons, then simulate a Save-As to get the data.

I'm sure Puppeteer looks fine as well. Never seen a website that has beaten Selenium as it's technically impossible, if an user can see it, it's able to get it. I've noticed people who claim it cannot scrape a site usually don't know how the site's javascript works.
 
Back
Top