I understand this really is another question about this subject but I'm a complete beginner within the NoSQL world and so i want top tips. People at SO explained MySQL may well be a bad idea with this dataset so I am asking this. I've plenty of data within the following format:

TYPE 1

ID1: String String String ...
ID2: String String String ...
ID3: String String String ...
ID4: String String String ...

that we am wishing to transform into something similar to this:

TYPE 2

ID1: String
ID1: String
ID1: String
ID1: String
ID2: String
ID2: String

This is actually the most inefficient way but I have to have the ability to search by both key and also the value. For example, my queries would seem like this:

  • I should understand what all strings confirmed ID consists of after which intersect their email list with another list acquired for any different ID.
  • I should understand what all IDs have a given string

I would like to accomplish this without changing Type 1 into Type 2 due to the sheer space needs but want to determine if either MongoDB or CouchDB or something like that else (someone recommended NoSQL so began Searching and located both of these are extremely popular) would assist me in cases like this. I'm able to a 14 node cluster I'm able to leverage but want top tips which the first is the best database with this usecase. Any suggestions?

A couple of extra things:

  • The input may be static. I'll create new data and can not modify the existing data.
  • The ID is 40 bytes long whereas the strings are about 20 bytes

MongoDB enables you to store this data effectively in Type 1. Based on your utilize it may be like one these (information is in JSON):

Variety of Strings

{ "_id" : 1, "strings" : ["a", "b", "c", "d", "e"] }

Group of KV Strings

{ "_id" : 1, "s1" : "a", "s2" : "b", "s3" : "c", "s4" : "d", "s5" : "e" }

According to your queries, I'd most likely make use of the Variety of Strings method. Here's why:

I should understand what all strings confirmed ID consists of after which intersect their email list with another list acquired for any different ID.

This really is easy, you receive one Key Value look-up for that ID. In code, it might look something similar to this:

db.my_collection.find({ "_id" : 1});

I should understand what all IDs have a given string

Similarly easy:

db.my_collection.find({ "strings" : "my_string" })

Yeah it's that simple. I understand that "strings" is technically an assortment, but MongoDB will recognize the product being an array and can loop through to obtain the value. Paperwork with this are here.

Like a bonus, you are able to index the "strings" area and you'll have an index around the array. Therefore the find above will really perform relatively fast (using the apparent trade-off the index can be really large).

When it comes to scaling a 14-node cluster may almost be overkill. However, Mongo does support auto-sharding and replication sets. They can interact, here is a blog post from a 10gen member to enable you to get began (10gen makes Mongo).