Used to do a webcrawler also it card inserts various pages and links within the database. Right now, the domain from the URL indexed is really a attribute within the page as well as in the hyperlinks table.
I am considering developing a table for that domain names, however i fear this slow the insertion.
Right now, I've 1,200,000 links downloaded and 70,000 pages in database and will also increase.
What's the better means to fix do? Produce the domain table? Produce a index within the domain attribute(it is a varchar)?
PS: A other program which i developing is going to do queries within this database.
Basically understood properly you've two tables: "links" and "pages". You say nothing concerning the fields within individuals tables. More details could be nice.
Anyhow, a completely stabilized database has a tendency to erode the performance. I recommend keeping the domain names as attribute both in tables. Just a little redundancy might enhance your performance.
Yet another advice, rather than getting one database, you might like to have two: one for card inserts and updates only and also the other one for read-only access(chooses).
Within the first DB remove all indexes and constrains. This provides you with fast place/update procedures.
Within the read-only DB, design indexes correctly to create the retrieval procedures faster.
Obviously, you have to synchronize the 2 databases in some way. This may require additional coding.
You'll most likely need to do some experimenting to determine what type of results you achieve with a home different techniques. The number of different domain names have you got?
Bear in mind when you create a catalog around the domain attribute it'll really decelerate your card inserts. Indexes are great for enhancing choose performance however they decelerate update/remove/place procedures since it is an additional factor that must get up-to-date.
I'd personally go the domain names inside a separate table if you will find a comparatively few.
I do not understand why you would not normalize.
Certainly, this can affect, slightly, the performance from the insertions, however i would hope the bottleneck (as well as the throttling) could be at the amount of the page dowloads. Whether it weren't the situation, this could indicate that you are whacking the h' from the Internet! -)
Typical spiders [outdoors of those utilized by large SEs obviously], even if operate on multiple threads as well as on several machines only produce, as a whole and sustained fashion, a couple of dozen pages per second, that is well underneath the capacity on most DBMS servers, despite a little of contention.
Also you might expect the domain names table to become relatively small , utilized frequently, mostly reading through, and therefore generally cached.
I'd only consider denormalization along with other methods within the situation of
- much greater sustained insertion rate
- bigger database (say, if it's likely to grow in above, 100 million rows).
Presuming your database design is much like so:
Page: Id | URL Link: Id | Page_Id | URL
If there's lots of re-utilization of Web addresses (like for TVTropes), I would definitely reformat the look to:
Domain: Id | URL Page: Id | URL_Id Link: Id | Page_Id | URL_Id
When you attend do your datamining, I'd then recommend a catalog on URL, additionally to any or all the typical ones.
If space has become an problem (a lot more than place or retrieval occasions), and you will find numerous levels for your Web addresses (deep folder structures), you could attempt this -
Domain: Id | Parent_Id | URL_Part Page: Id | URL_Id Link: Id | Page_Id | URL_Id
This can obviously require a recursive query to put together the URL, however the datamining prospects with this are immense. Without learning more about your actual design (as well as your intended use), there isn't much more I'm able to really propse, though.