To begin with, I really hope my question does not bother you. I really should get and concept of the way i can accomplish this, but unfortunatelly, I am a real beginner, I am moving if this involves programming. I am battling to understand it the easiest way I'm able to. I'll appreciate any assist you to produce.
Here's the job: I had been purchased to find away out to gather some data from the website utilizing a c# application. This is done everyday, to be able to update the information which we'll use to calculate some financial index.
I understand my question might seem vague, anyway, even saying the way i could be more precise can help me. I understand I appear to understand desperate, but putting appart all of the personell issues, my scholarship type of is dependent onto it.
Thanks ahead of time! (Please, don't mind unhealthy British, I am brasilian and my British is probably not so good yet.)
First, your British is okay. Actually, I figured you had been a local speaker before you stated otherwise.
The word you are searching for is 'site scraping'. Observe this: http://stackoverflow.com/questions/2861/options-for-html-scraping. The 2nd answer indicates an HTML agility pack library you should use.
Now, you will find two options here. The very first is you need to parse the HTML and scrape your computer data from it. This really is more computationally intensive and is dependent around the layout from the page. When they change how a site looks, it might break the scraper.
The 2nd possibility is that they provide some XML or JSON web service you are able to consume. Within this situation you are not scraping anything, but they are rather utilizing a true data feed. When the layout from the site changes, you won't break. Whether your target site supports this type of information feed can be the website.
Basically understand your question, you are being requested to complete some Web Scraping, in which you 1) download the items in an internet page and a pair of) attempt to parse data from that content.
For step #1, you need to consider utilizing a WebClient object in C# to download the HTML on the internet page. You are able to provide a
WebClient object the URL you need to download this content from and acquire a
String that contains this content (most likely HTML) from the URL.
The way you start doing step #2 is dependent on which submissions are present at the site. Knowing of certain designs you are searching for within the HTML, searching the HTML string using various techniques. A far more general solution for parsing HTML data are available through while using Html Agility Pack, which enables you to handle the HTML like a tree structure (DOM).
Ok, this can be a pretty straightforward application design, and lots of the code is available that you could reuse. Since you are a novice, I'll break lower into steps of what you ought to do and recommend approaches.
1) You'll use classes from System.Internet to drag the webpages (WebClient being the simplest to usse). You will need to have this area of the program operate on a timer if you're able to (while using scheduled jobs feature from the OS) and also have it simply pull the web pages and drop these questions folder.
2) You've got a second job that will run individually, tugging unread files from that folder, parsing them (while using HtmlAgility pack library is better) after which storing them within an index of some type (Lucene is better for your)
3) You've got a front-end use of some kind (web or desktop) which queries that index for that information you are searching for.