What is this? This is just something to summarize the mistakes that I found from my experience as a Software Engineer.
The first one that I want to talk about is the war between SQL and NoSQL. This is one of my favorite topic. I can debate about this topic for a whole day. 😁
It doesn’t mean I hate NoSQL. I just want to use the right tool for the right job…
This may be true at this time but who knows what will change in the future? I have faced a lot of problems with the design of NoSQL, especially when it is the main backing service for the whole system. Each database has its own strength and weakness and we should use it for the most suitable job or at least, pick the one that can solve 80% of your problems and live with the other 20% pain instead of choosing the one that fits for 20% use cases and fix the problem that it brings to the other 80%.
By NoSQL, I mean Document database systems (Mongo DB for example)
A real life loop
Back in the 1970s, the most popular database used a simple data model called the hierarchical model, which is similar to JSON model used in document databases today. That model worked well for one-to-many relationships. However, when it comes to many-to-many relationships, this design exposes a whole lot of problems, from joining issue to duplicate data management. Two alternative solutions were proposed, Relational model and Network model. The fact is obvious, Relational model wins the long run.
It seems that NoSQL databases today are repeating the problems in the history, which have been solved for decades. Despite the advantages of a more flexible data schema or a better representation of data (which is more similar to objects in programming language), the 1970s’ problems still exist and developers still have to handle them manually in an awkward way.
The problem of Document Storage model
Usually in NoSQL, you are encouraged to store the object and its relations in one single document instead of following the 3 forms of database normalization. For example, in an Insurance AMS system, a Customer object also contains the list of all its contacts (spouse, parents or children), the list of all the policies this customer has purchased and a collection of all the claims this customer has made. This leads to a table consisting of multiple large documents. So what is the problem here?
- What happens if you just want to read some fields of the Customer object (
firstName
,lastName
oremailAddress
)? - What happens if you only want to update some fields of the Customer object?
- What happens if you only want to work with the Customer object itself, not the Policy, Claim or Contact object?
Unless the fields you want to work with stays on the index tree, I cannot imagine there is any solution to interact with each individual field in an optimal way. The database system has to load the whole document into memory to process, to read only some specific properties, to update only some fields and then write the whole document back to disk.
- What happens if your customer object contains a lot of policies? It can be a Commercial customer (a Company buying insurance for all its employees). The whole object will be loaded to memory just to get/set some fields.
- Even if you store small objects, what happens if you read/update just one field of millions of object? All related documents will be loaded into memory just to get some specific fields.
We had this problem in the past. We simply stored the objects without worrying about the size of each record until we realized it was a big problem. All the solutions led to SQL design! The Document Storage can cause performance problems for your application when you scale
- From my experience, except for some specific cases (which will be mentioned later), you rarely have to do CRUD on the whole object. Each feature will require only some fields to work, not always the whole.
- As your application grows, there will be more properties added into your object. That will slow down your query from time to time.
- You will soon have to fix that problem in an ugly way, either by creating a cloned version of the table with only some necessary fields or use a SQL design.
Why don’t use SQL from the beginning?
A Non-Relational world?
Relations still exist in NoSQL, but…
You can still organize your entities into different tables (or collections in Mongodb), but that’s not the design philosophy of those Document database systems. They encourage you to combine the related entities into just one single object, especially for one-to-many relationships. It best fits for the case you mostly interact with the parent objects. The real world is a lot more complex than that.
Take a look at this example. You are building an application to automate email marketing for your
insurance agency. Initially, the application targets the customers of the agency only. To support
that, you only need a table customers
to store the customer objects with some basic fields. As your
application grows, the client requests to add some policies information into the email message (a
list of policies that the customer has purchased, for example). You will then update the objects in
that table to include a new array field called policies
. That works fine and your business
keeps growing. Then you decide to add more features to work with Policy entity type. Now, you have 2
choices
The first option is to keep the current table structure, add more fields into the policies
array
property of the customer
object. This solution is backward-compatible, doesn’t require any changes
to the current application code. However, you will soon realize the troubles. It will come back
to the problem of large objects that I mentioned above. It will affect the performance of all other
read/write operations. You will pay the penalty when interacting with only some fields of the
customer object. Another problem is that it’s hard to work with the Policy objects directly. You
will always have to go through the parent customer object. To retrieve a single policy, you will
have to query the parent customer, load the whole object into memory (either database memory or
application memory), filter to the policy that you are looking for and then return it. To query the
list of policies and then sort by some conditiion, you have to query all the customer objects in the
system just to get the policy data and then aggregate them. Your application code logic will be more
and more complex, the performance of your application will keep decreasing.
Another way is to create a separate policies
table to support for policy-related features. If you
keep the existing policy data in the customers
table, you will have to deal with the duplicated data
problem. How to maintain the consistency between the 2 tables? That should be the database job,
shouldn’t it? Now, it’s likely that you have to maintain by your own on application level.
What happens if you instead decide to just store the policy data in the policies
table? You will
have to update several places in your application code to match with the new data schema. It’s not
also not recommended in Document databases. You will have to either use an ugly joining solution
from the database or do the join inefficiently yourself on application level. Why don’t you just use
SQL to get all those benefits for free?
As your business grows, you can never store everything inside one object. You will have to break down your objects into smaller ones and create the link between them. This is when you realize the importance of SQL
Is that all?
Part 2 is here
Does that mean you should not use NoSQL?
No. It has its own use cases. Check in Part 2