I understand this can be a very generic and subjective question, so you can election to shut it if it doesn't satisfy the StackOverflow netiquette.. however for me, it's really worth trying )
I have never built a higher-traffic application since now, so I am unaware (aside from some reading through on the internet) about scaling practices.
How do i design a database that, whenever a scaling is required, I do not have to refactor the database structure, or even the application code?
I understand that development (and optimisation) should come step-by-step, optimize bottleneck because they happen, and it is extremely difficult to create an ideal structure when you do not know the number of customers you will have and just how would they will use the database (e.g. read/write ratio), I am just searching for a great base to begin.
Do you know the guidelines to make a structure almost prepared to be scaly with
sharding, and what
hacks should be absolutely prevented?
Edit some detail about my application:
- The applying will run like a multisite behavior
- I'll possess a database for every application version (db___1, db___2, etc..)*
- Every 'site' may have a schema in the database* along with a role that may access only their own schemas
- Application code is going to be mostly PHP and couple of things (daemons and maintenance things) in Python
- Web server will most likely be Nginx and lighttpd or node.js as support for lengthy-polling tasks (e.g. chat)
- Caching is going to be completed with memcached (plus apc for things strictly associated with the php code, as possible used outdoors php)
Now you ask , really generic, but listed here are couple of tips:
Don't use any session variables (pg_after sales_pid(), inet_client_addr()) or per-session control (SET ROLE, SET SESSION) in application code.
Don't use explicit transaction control (BEGIN/COMMIT/SET TRANSACTION) in application code. These kinds of logic ought to be covered with UDFs. This allows stateless, statement-mode pooling which allows quickest possible DB pooling. (see pgbouncer docs, and pg wiki for more information)
Encapsulate all Application<->Db communication in well defined DB API of UDFs - where you can use PL/Proxy. If carrying this out with all of Chooses is simply too hard, get it done a minimum of for those data creates (Place/UPDATE/Remove). Example: rather than
INSERT INTO users(name) VALUES('Joe')you'll need
look at your DB schema - could it be simple to separate all data owned by given user? (most most likely this is the partitioning key). All that's left is typical, shared data which will have to be duplicated to any or all nodes.
think about caching before you really need it. what's going to be caching key? what's going to be cache timeout? are you going to use memcached?
While not responding to every part of the question, I discovered this very informative, particularly the section on multiple authors.