Data redundancy in databases occurs when same piece of data is stored multiple times in the same table object or multiple table objects. This condition causes data “undesirable” anomalies when conducting transactions and operations inside of a transactional database.
The keyword here is transactional. In analytical databases, the anomalies are welcome most of the time and encouraged. See differences between OLTP vs OLAP.
In a database, the data should always be consistent. Otherwise, the database loses its meaning of storing and accessing data efficiently. Different techniques are used to maintain data integrity. On the other hand, many factors can affect consistency due to redundancy and data anomalies.
But, what do data anomalies and data redundancy actually mean? This guide will cover everything you should know about the two topics.
Data Anomalies and Data Redundancy: A Detailed Guide
Data Anomalies
Database anomalies refer to possible errors when using data manipulation in a database. They can cause data inconsistencies. It is usually caused by operations, such as insertion, deletion, and update. It may occur when a particular record is stored (inserted) in multiple locations in the same table or different tables.
Similarly, if the intention is to update one record in a table, how do we ensure that the same piece of data is updated in all records in that particular table or other tables.
The same applies for deletion. If a piece of data is to be deleted, what records should be part of this deletion?
These anomalies are extremely evident when trying to normalize a database. If the database is not normalized, simple manipulation queries can become a horror story.
What causes Data Anomalies?
The ultimate cause behind a database anomaly is either poor planning or a flat structure. The process to remove database anomalies is known as “Normalization.” It is usually completed in several steps. Each step converts the database into a different normal form. These are written as 1NF, 2NF, 3NF (1st Normal Form, 2nd Normal Form, 3rd Normal Form), etc.
During the design and modeling steps of a database. Normalization must be taken into consideration to avoid costly queries and performance hits later on.
Anomaly Types
Insertion Anomaly
This one occurs when the user is not able to insert a record in a database without adding certain attributes that may be uncertain at the time of insertion.
For example:
Consider that in a school database there is a table called teacher. As the name says, this table should hold information about the entity teacher. However, in this scenario, the table also have a column for the name of the course taught by a teacher.
The problem here is that if I want to insert a record for another teacher called Mary, and Mary for Mary, we have not determined the course that will be taught by her, there would have an anomaly because now, or we are not able to insert here or if we insert, there will be a NULL value for it in the course.
Course should be treated as another entity, hence should have another table that would relate with the teacher table.
The other also could be that Joe, used in the table example, may tech another course. Let’s say Music!
Now we have Joe twice in the table. That caused redundancy because in reality he should only have one record in the teacher table. Those courses should have been in a different table, probably called course, and another table would have been created to create a relationship. That would be based on cardinality!
Update Anomaly
This type of anomaly occurs due to data redundancy. It is also referred to as ‘partial update.’ It disables the user to update a record without performing another action, like deletion or addition.
Another case is when a user can not update all the records related to one entity. Thus, they need to update all the places that record exists manually.
For instance, if you want to change the newly-assigned teacher’s course, you will need to modify it from other places too. Or, you might not be able to update at all without performing a database operation.
Either way, this anomaly can cause huge data inconsistency.
If we check the example above with Joe teaching Math and Music. let’s say that coincidently, I have another Joe in the same school that also teaches music.
I made it easier in the example above by defining a teacher id so that we can identify who is who. In the case above, the first 2 Joes is the same person. The third Joe is another person. Anyway, the request could have been to update the age for Joe from 38 to 39 because that was a typo mistake.
The table above is in First Normal form, so it would be easy doing the update that way, let’s now make it unnormalized by removing the
Deletion Anomaly
This type of anomaly occurs when a user is deleting a data, but that results in the deletion of some other wanted data as well. For example, if you want to delete the teacher’s course, it will remove the entire record of that particular teacher.
In the example above, I would like to remove the course math from the school. For that I would have to do something like:
DELETE FROM #Teacher WHERE TeacherCourse = 'Math'
The result would be:
I go rid of the Math course, but I also got rid of Joe as a teacher. Maybe I wanted to keep Joe for something else. In that case I will have to add him again. What if we try to remove the teacher Carol but want to keep her courses available. In this case we can’t, to remove her, I will have to remove her courses as well.
All of these issues can only be solved by the use of normalization. There is no way around it. It is part of the database modeling and design process to achieve high performant transactional databases while avoiding all these anomalies.
- Data Redundancy
Data Redundancy is a condition that occurs in a database when the same piece of same data is stored in multiple places. However, these several occurrences are for a common purpose, making them useless and harmful for the database.
Although it might seem simple, even the biggest companies face data redundancy. It is also linked to the three data anomalies we discussed earlier. Therefore, data redundancy is also removed through Normalization. Each normal step has its conditions and eliminates the portion of redundancy.
Once data redundancy is completely removed or minimized, the database is considered efficient.
There are various drawbacks of data redundancy on the database. Let’s go through them to provide a clearer picture.
- Data Inconsistency
Data Redundancy can lead to inconsistency in the database. It is a condition when the same data exists in different formats in multiple tables. So, when a user accesses that particular record, they might get misleading or incorrect information.
That is mainly because different tables are created for easier access and updating of data. However, it can be difficult to deal with when you find that the same data is not found in the same format.
In fact, a company’s biggest demand for databases is consistent data. They want their databases to maintain the integrity and provide meaningful information. Whereas, data redundancy does exactly the opposite.- Large Database Size
Data Redundancy also leads to increased database complexity. That is mainly due to the increasing size of the database. It makes it much more challenging to maintain or update data.
But that’s not it! A larger database also leads to slow loading time. Since the same data is scattered throughout a database, it can take longer to find what you are looking for.
A business wants its databases to be as quick as possible. And this can only happen once you eliminate data redundancy.
- More Cost
This drawback is especially harmful to organizations. Many businesses try to keep their budget restricted when it comes to databases. However, data redundancy can enable you to do so. That’s because it directly impacts your storage, so you will need to spend an extra cost. It can become quite expensive for small and new businesses if they do not minimize data redundancy.
- Large Database Size
With that said, there are also some benefits of data redundancy. For instance, you get better data security as well as backup measures. Moreover, you can also double-check your data due to this condition. Hence, it also improves the reliability factor. Many companies use some sort of data redundancy intentionally to enhance their databases. Conclusion
And that brings us to the end of this detailed guide on Data Anomalies and Data Redundancy. We have covered their definitions, types, drawbacks, and more!
We have also briefly discussed Normalization. It is the only way to remove both Data Anomalies and Data Redundancy. From a business point of view, it is extremely important to keep your data consistent!