We're creating a MySQL table to trace the amount of fans every day for 10,000s of Twitter accounts. We have been battling to determine the best method to store this data. The 2 options we're consider are:

1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers

Option 1 would lead to about 30x less rows than Option 2. What I am unsure from the performance perspective is that if it's more suitable to possess less posts or less rows.

When it comes to the queries we are using, we would like to have the ability to query the information to obtain the quantity of fans for any specific Twitter take into account arbitrary time ranges.

I'd appreciate suggestions which approach is much better and why. Also, if there's a far greater option compared to ones I present don't hesitate to suggest it.

Thanks ahead of time for the help!

Option 2, no doubt.

Imagine attempting to write a question using each option. Let us provide the best situation for option 1: We all know we would like the entire for those 31 times of the month. THen with option 1 the totally:

select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
 +day31 as total
from table1
where month='2010-12';

select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;

The 2nd looks way simpler in my experience. If you do not think so, let me know should you immediately observed the mistake within the option 1 version, and when you are certain that no programmer would ever make this kind of error.

Now think the needs change just slightly, and someone wants the entire for just one week. Using the second version, that's easy: provide a time frame that describes that week. This might be easily done when creating a query quickly: JUst request for start date and add 6 days into it for that finish date. However with the very first version, what will you do? You'd have to determine which times of the month fall for the reason that week and alter their email list of fields retrieved. A few days might span two calendar several weeks. This is a huge discomfort.

Regarding performance: Sure, more rows harder to retrieve. But longer rows also harder to retrieve. Lesson 1 on database design: Don't get rid of normalization to perform a micro-optimisation whenever you don't actually have a valid reason to think there is a problem. Develop a stabilized database first. Then whether it works out that you will find performance problems, tune it later on. Chances are available a quicker hard disk for a great deal under the price of eventually of programmer's time taken getting a mistake within an unnecessarily complex query.

Offcourse it is dependent on which queries you will do - but unless of course every query necessitates the 31 times of that month, for the operational data, Use Option 2.

  • It's better from the logical perspective (say afterwards you wouldn't want queries per "30 calender days", but "last X days")

  • It's better for creates, too (only update 1 row with 2 fields rather than overwriting all fields).

  • You could optimize later (partitioning involves mind)

  • Your computer data-warehouse can nonetheless be enhanced for lengthy-term aggregate statistics.

Use Option 2. Option 1 will be a nightmare for queries. MySQL has good support for doing date ranges in queries, so it's simplest to simply have row daily.

I'd say option 2, however, you would most likely wish to give a area for any primary answer to accelerate queries. And when that primary secret is an integer value, better still.

Option 2 certainly (having a two-column unique key/constraint on Twitter ID and Day).

Option 1 will you need to be unlucky.