Basic app idea is to monitor RSS feeds of stock market news feeds, feeds usually direct to webpages.
Form webpage collect info and miss images etc high data content.
Then save pages to disk in compressed format with timestamps.
Later open viewer and select article text area of webpages by offering few example text areas of same site to autocollect rest of the same page text from other article files of similar html code and save to disk only article content without rest of html code.
Keeping 2 formats(filtered articles and html pages) on disk would allow for later repairs if things go wrong.
If some change has occured to site code or problem detected in article filtering ask again to set text part up by few examples and autofind further text.
Maybe I am reinventing the wheel here and similar software is available?
Does anyone have good idea for method on how to easiy filter out news articles in webpages for automatic saving?
Ideal config would be if fully automatic but that may not be possible without machine learnig and enough manually configured sample data.
One idea is to seek html code to filter out the text.
But better idea would be if it is the html code filtering+ visual space ratios of rendered page.
What would be easiest way to connect web page text positions to html code positions?
I am sure someone here has done something like this or similar software is already avalible.
Form webpage collect info and miss images etc high data content.
Then save pages to disk in compressed format with timestamps.
Later open viewer and select article text area of webpages by offering few example text areas of same site to autocollect rest of the same page text from other article files of similar html code and save to disk only article content without rest of html code.
Keeping 2 formats(filtered articles and html pages) on disk would allow for later repairs if things go wrong.
If some change has occured to site code or problem detected in article filtering ask again to set text part up by few examples and autofind further text.
Maybe I am reinventing the wheel here and similar software is available?
Does anyone have good idea for method on how to easiy filter out news articles in webpages for automatic saving?
Ideal config would be if fully automatic but that may not be possible without machine learnig and enough manually configured sample data.
One idea is to seek html code to filter out the text.
But better idea would be if it is the html code filtering+ visual space ratios of rendered page.
What would be easiest way to connect web page text positions to html code positions?
I am sure someone here has done something like this or similar software is already avalible.