The main feature in RDBMS (Relational Database management system) is the relational or the relationship part of it. If done correctly, it makes the connections among entities easier and faster to retrieve. It also eliminates data anomalies and redundancies.
Business rules will dictate that, so a clear understanding of the rules will make it easier to identify those relationships and how they should be connected.
Relationship and Objects
These are some terms and objects inside of RDBMS that should be understood in order to talk about relationships in the database world. They are:
- Primary Key
- Surrogate Key
- Foreign Key
- Natural Key
- Candidate Key
- Unique Key
- Composite Key
- Super Key
- NOT NULL Constraint
- Parent table
- Child table
To be able to make a good database design and make a database model that works, the terms above must be understood throughout.
Let’s understand those terms, before we talk about relationships.
Primary Key
Primary Key is the unique identifier of a row. It can have one or more columns. If it has more than 1 column, it will be called a composite key. Every time a primary key is created, a clustered index is also created. The clustered index holds all the column of a row. Tables with primary indexes enforced are called clustered tables. That name is simply because of the clustered index created on it.
The clustered index, in this case the primary key is ordered in the table. So, when an INSERT happens, SQL Server must make sure it will insert in the right order. The database will mange that order automatically, however, it is good to note that it will make transaction operations slower. In this case, however, the benefits of having a clustered index outweighs in a big time the drawbacks of it.
Most of the times, we can find combinations that will make the row unique. In that case, these rows are candidate keys to become a primary key. However, we may not have control of the datatype, or we don’t really know the rule behind that that column specification.
A good example is the Social Security Number, for a time, many people thought that SS numbers were unique per person. We know, however, that it is mathematically impossible to be unique for every single person, when the total number it can hold is less than the total individuals that lived, live, or will live in the U.S.
Another option to make it “unique”, could be the association of this number with the name of the person, date of birth or any other attribute of a person. If we add all these possible attributes to the social security number, we will now have a very possible unique identification in the table. Those 3 could make an almost “perfect” candidate key. However, we are not sure, if, this is what IRS really does to identify a person or if they use something internally that we are not aware of.
Since IRS, as far as we know, manages these Social Security numbers, they could tomorrow say that these numbers now would also start to have letters or that the number of characters will increase or any other type of change that they deem necessary.
The problem here is that, as developers or even business people, we do not have control of what IRS will do. IRS will also not care if any changes they make will break anyone’s system.
Surrogate Key
Surrogate Keys are the solution for this problem. This key is a non-business meaning key, that will guarantee that the primary key is unique and under our control. So instead of using Social Security Number as my primary key, which is a candidate key, and will talk more about it later, we will use a key that we created and make a constraint determining that that key is the primary key.
In other words, surrogate key is a primary key with no meaning that help us have control on the unique identifier in a table.
That helps developers, applications, and users to avoid confusion.
There is a debate among SQL developers on “Should we have surrogate keys or our primary key should always be made of attributes that make the row unique?” If we research on this topic, there are good arguments in both sides. However, I don’t think anyone holds the complete truth, so a decision has to be made by developer or the team in what approach to use.
My opinion is to use surrogate keys as much as we can to speed up development. Why would that speed up development? The reason is because JOINS in foreign keys are mostly done with primary keys. If the primary key is made of more than one attribute, the developer will always have to include all of them in the joins. That could slow development and cause debugging confusing.
Surrogate keys can be of any data type, however, the most used is a numeric integer number. There is also an explanation for it to be numeric. As I said before, a primary key is sorted. Numeric values are easier to sort. Numeric values will help particularly on data manipulation language.
You will still find database applications with strings or unique identifier types for this key, but the hard and cold truth is, it will cause performance issues eventually. So, try to avoid it if you can!
Foreign Key
Foreign Key is a primary Key or unique Key of another table. Primary key will always be an unique key, unique key will not always be a primary key. It sets, first, the relationship between two tables and second ensures data integrity. What do I mean by integrity? Imagine Table A has a Primary Key named A_ID. I can make that A_ID a foreign key in table B. If I try to insert anything in table B, but in that insert the A_ID value does not exist in table A. It will fail because SQL Server is expecting a value existing in table A. The relations must exist no matter what!
Foreign Keys do not have to be unique in a table, we can force it to be, but it is not needed. One table can have foreign keys referencing many different tables at once. A table can reference itself as well. In this case, table A would have a primary key and a foreign key that references its own primary key.
Natural Key
Natural Key is a column that by its nature could be a primary key. So, let’s bring back the SSN example again, it would fit almost perfectly as a primary key, if it was unique. Another good example is an email address. An email should be unique per person, in this case it is a natural key for sure. If wanted, they can be the primary key of that table. However, control could be an issue again.
So, if the attribute have the possibility of being unique in a table, it will by default be a natural key. Now, this is a concept. There is no way to tell the system that the attribute is a natural key. The other keys, however, we have to explicitly identify them as primary keys or foreign keys.
Candidate Keys
Candidate Keys are columns that could be a primary key. Yes, it is like a natural key, but at this point we are only looking for possibilities. So, why do we even care about them, if “we” will end up creating a surrogate key at the end of the day? The answer is simple, in order to be able to normalize this database later, we must know what each column represents.
If we decide to go with surrogate keys, we also now know who are the possible columns that will or should yield unique values. It will be easier to create constraints later on those tables. If documented, developers will also be able to create logic around it and the system will be able to enforce integrity on the data.
If we decide to not use surrogate keys, we will be able to clearly know, based on documentation the least amount of columns that makes a primary in a specific table and use it as the primary key.
A table, could have many different candidates, but only one should be promoted to be a primary key.
Unique Keys are created every time we create a primary key or any time we add an UNIQUE key constraint to the table. It ensures that that the column has a unique value thought the table. This key, just like a primary key, can also be reference by another table as a foreign key.
Why use unique keys? We can go back to the Social Security example again. Maybe our system only allows a unique social security in an specific table. Even thought we determined that the same Social Security numbers can be assigned to more than one person. We may still decide that we want unique social security values per person in a table. So, even thought the primary key is not the social security, since we decided to create a surrogate key as a primary key. We are still able to enforce only one social security per person in that table.
The odds of getting the exact same social security in a random company for 2 different people is very slim, but if it ever happens, it will be up to the business to decide to add another column to make it unique. A good place to look for would be by going back to the candidate keys.
Composite Keys
Composite Key is a combination of one or more columns, and they can also uniquely identify a row in a table. Their association will become a primary key. This almost interchangeable with candidate keys. The difference is that maybe the candidate key is made of only one attribute. In that case it would not be composite key.
Composite Key vs. Compound Key
Many people use composite key and compound key interchangeably. Even thought, they are so similar in almost every sense. That is not 100% correct. Compounded keys are made of columns that are primary keys of other tables. There would be more than one column and those columns by default would be Foreign keys in the subject table.
Composite keys can be made of any attributes regardless if they are primary keys form other tables or not.
Super Key
Super keys are candidate keys that after all possible reduction on the number of attributes, became the primary key. So, imagine that in a table we have columns Social Security Number, Date of Birth and FirstName. Those three attribute per itself could be a candidate key, considering that they will form an unique row in the table. We can squeeze a little more and see that if we remove the FirstName out of the candidate key, we now have another candidate that consists only of Social Security Number and Date of Birth. This would probably the most reduced number of columns we could use to become the table unique.
In that case, this second candidate key will be promoted to be our super key and if we have not intentions to create a surrogate key, this candidate key will also be our primary key. If we create a surrogate key, by default the surrogate key becomes the super key as well.
So all these terms, based on concepts, can be used interchangeably. However, it is very important to understand when they become synonyms to one another.
NOT NULL
NOT NULL is a constraint that will enforce that the column is not empty. It can be blank or zero or anything based on the datatype, but never empty. When creating tables, NULL will be the default value.
Most RDBMS interpret NULL value as unknown. So, if we try to do operations with null values, we may get unexpected results. For example, if we try to add a number to a NULL value, the result will be NULL. The reason for that is because the database does not know what the value is.
If we try to do aggregation in columns with NULL values, those records will not be part of the aggregation results. I see that as problems in many applications with wrong results being given, specially in the reporting world, simply because developers most of the time do not handle that scenario properly.
How do we use NULL values wrong?
The big problem with NULL values, is the culture created around it. The culture is that if a record may or may not have a value in a particular column, let’s make it NULL. Again, all developers did or will do that at some point in time. If we never do it and start working for another company, we will see that someone else did it. NULL values indicates lack of Normalization.
NULL should be used in situations where we clearly do not know what should be the value there but we clearly know that the data exists. For example, in school enrollment, maybe to enroll a person, name and date of birth must be inserted. However, for whatever reason, the person being enrolled is being enrolled by someone who does not know the person’s date of birth. In that case, we know that the person has to have date of birth but it is unknown at the time of enrollment. We may allow the enrolment to happen and later on that person will come and give us the date of birth so that we can update that record.
So, we truly don’t know the data right now, but we know that the entity (enrollee) have the data.
The culture today is if we create a table with many columns and one of those columns where an attribute may exist or not, then make it NULL. That is just wrong, but…
Parent table is the table that holds the primary key as foreign key in a related table.
Child table host the primary key of the parent table as one of the columns in its table.
When designing a database these objects must be understood to be able to make an excellent database design. They will be dictated by many of the business rules and will define relationship and how data should be manipulated and/or retrieved. If done correctly, development will be smoother and onboarding new database professionals will also be easier.
It will also make the normalization process suitable and easy going.
There are three different relationships when designing a database application. They are:
- One to one
- One to many
- Many to many
One to One Relationship
As the name implies, in this scenario, we would have two different tables connecting to each other with the first table having a record and the second table having one single record associated with it. Many database professionals believe that since these two tables is mapping one to one, why not have it in the same table and avoid a join?
There are few reasons, and let’s start with the 8kb page mentioned above. Let’s say that we have a field in user table and that that field is PersonBio taking NVARCHAR(MAX) as the datatype. In the worst-case scenario, that will fill out the whole 8kb page. If I have that field in association with other fields, we will most likely have a page split.
Also, since this field will be part of the clustered index page, all the columns including this one, may live in the cache area, which will cause the database to clean data more often making the system somewhat slow.
So, if the column in question utilizes any of those datatypes to its MAX. We should consider putting them in a different table, specially if that field is not accessed very much. I certainly do not want things in memory if they are not utilized that much.
I used that TEXT with MAX example to help illustrate the problem, but it can apply to any column that is not very utilized. Let’s say that we have a table with 30 columns but out of those 30 columns only 20 are frequently queried, why not separate those 10 into a detail table with a one to one relationship? That sounds like a good idea!
Another good reason to have a one-to-one relationship is for fields that for, private purposes, should not be queried or live in the same area as other information. Columns like username and password are good candidates to be separated in a one to one relationship. This would mitigate security risks for accessing that data without necessity.
Here is an example:
Notice that PersonID is in both tables. The parent table, in this case Person has a Primary key named PersonID. The child table has a primary key also called PersonID. This will also be the Foreign key referencing Person. In that way uniqueness will be enforced as well.
There is also another approach you may see. Check a very similar example:
In this case, we have the same parent table, Person, holding the primary key, and the child table, PersonDetail, holding a Foreign key column called PersonID, which is referencing to the Person table. To make sure this works as one to one, we need to create a unique key on PersonID in the child table.
One to many relationship
In this case we would have one record in a table connecting to many records in a different table. This concept is much easier to understand than the one to one because in this case, we can easily notice the need for this.
A person, as we know, may have many different emails. However, that email can only belong to 1 single person. Now we have a one to many relationship.
See the tables below:
In that example we see the parent table called Person with a column named PersonID. PersonId is the primary key. We then see the child table called Email with a foreign key of the PersonID in the third column.
Can you see how Joe (PersonID = 1) has more than 1 email stored? That characterizes the one to many relationship. List table are the ones to usually hold this type of data. The connection is usually between a Core or parent table with this child or list table. In tables that are normalized to the 3rd or above form, will see these table rightfully hanging all over
Many to many relationship
In this case, we would have a table that connects with another table, but both can have multiple records of each other. For example, a person from the Person table, who buys products from the Product table. The customer can buy many different products from the Product table. The same product can also be bought by different customers.
In the example above we have 2 different tables. Notice that Joe can buy, if he wants, more than one type of phones. If he buys 1 iPhone and 1 Galaxy for example, he would now have to records on the product table to be connected.
Notice also that an iPhone can be bought by any of the other persons. So, if John and Jennifer also bough an iPhone 12, those records would have to be connected show how.
That would be a many to many relationship. The question then is how to connect them? In previous examples, we saw the relation of one parent table with one child table. This will not work here, without making the 2 tables extremely redundant.
We should now use a Cross-reference table to connect those two tables. Those two tables are, now, parents, and the cross-reference table is the child.
Now, this new table called Order, has a reference to the Primary key of the Person table and a reference of the primary key of the Product table. The primary key of the order table could be all three columns forming a composite key.
These cross-reference table start to show up in our designs the moment we start eliminating partial dependencies. Which is very common when normalizing from 1st to 2nd normal form.
Summary
In this article, we talked about the main terminologies when designing a database application. We also talked about the relationships among tables and how to determine them.
Designing a database requires some deep thought to make sure that it will be easier to develop, to understand, and to use once available.