Conceptualizing Data: Simplifying the way we think about complex data structures.
Preface Conceptual models in programming are essential for being able to reason about problems. We see this through code all the time, with implementation details hidden away behind abstractions...
Preface
Conceptual models in programming are essential for being able to reason about problems. We see this through code all the time, with implementation details hidden away behind abstractions like functions and objects so that we can ignore the cumbersome details and focus only on the details that matter. Without these abstractions and conceptual models, we might find ourselves overwhelmed by the size and complexity of the problem we’re facing. Of these conceptual models, one of the most easily neglected is that of data and object structure.
Data Types Galore
Possibly one of the most overwhelming aspects of conceptualizing data and object structure is the sheer breadth of data types available. Depending on the programming language you’re working with, you may find that you have more than several dozens of object classes already defined as part of the language’s core; primitives like booleans, ints, unsigned ints, floats, doubles, longs, strings, chars, and possibly others; arrays that can contain any of the objects or primitives, and even other arrays; and several other data structures like queues, vectors, and mixed-type collections, among others.
With so many types of data, it’s incredibly easy to lose track in a sea of type declarations and find yourself confused and unsure of where to go.
Tree’s Company
Let’s start by trying to make these data types a little less overwhelming. Rather than thinking strictly of types, let’s classify them. We can group all data types into one of three basic classifications:
- Objects, which contain key/value pairs. For example, an object property that stores a string.
- Arrays, which contain some arbitrary number of values.
- Primitives, which contain nothing. They’re simply a “flat” data value.
We can also make a couple of additional notes. First, arrays and objects are very similar; both contain references to internal data, but the way that data is referenced differs. In particular, objects have named keys while arrays have numeric, zero-indexed keys. In a sense, arrays are a special case of objects where the keys are more strictly typed. From this, we can condense the classifications of objects and arrays into the more general “container” classification.
With that in mind, we now have the following classifications:
- Containers.
- Primitives.
We can now generally state that containers may contain other containers and primitives, and primitives may not contain anything. In other words, all data structures are a composition of containers and/or primitives, where containers may accept containers and/or primitives and primitives may not accept anything. More experienced programmers should notice something very familiar about this description--we’re basically describing a tree structure! Primitive types and empty containers act as the leaves in a tree, whereas objects and arrays act as the nodes.
Trees Help You Breathe
Okay, great. So what’s the big deal, anyway? We’ve now traded a bunch of concrete data types that we can actually think about and abstracted them away into this nebulous mess of containers and primitives. What do we get out of this?
A common mistake many programmers make is planning their data types out from the very beginning. Rather than planning out an abstraction for their data and object architecture, it’s easy to immediately find yourself focusing too much on the concrete implementation details.
Imagine, for example, modeling a user account for an online payment system. A common feature to include is the ability to store payment information for auto-pay, and payment methods typically take the form of some combination of credit/debit cards and bank accounts. If we focus on implementation details from the beginning, then we may find ourselves with something like this in a first iteration:
UserAccount: {
username: String,
password: String,
payment_methods: PaymentMethod[]
}
PaymentMethod: {
account_name: String,
account_type: Enum,
account_holder: String,
number: String,
routing_number: String?,
cvv: String?,
expiration_date: DateString?
}
We then find ourselves realizing that PaymentMethod
is an unnecessary mess of optional values and needing to refactor it. Odds are we would break it off immediately into separate account types and make a note that they both implement some interface. We may also find that, as a result, remodeling the PaymentMethod
could result in the need to remodel the UserAccount
. For more deeply nested data structures, a single change deeper within the structure could result in those changes cascading all the way to the top-level object. If we have multiple objects, then these changes could propagate to them as well. And what if we decide a type needs to be changed, like deciding that our expiration date needs to be some sort of date object? Or what if we decide that we want to modify our property names? We’re then stuck having to update these definitions as we go along. What if we decide that we don't want an interface for different payment method types after all and instead want separate collections for each type? Then including the interface consideration will have proven to be a waste of time. The end result is that before we’ve even touched a single line of code, we’ve already found ourselves stuck with a bunch of technical debt, and we’re only in our initial planning stages!
To alleviate these kinds of problems, it’s far better to just ignore the implementation details. By doing so, we may find ourselves with something like this:
UserAccount: {
Username,
Password,
PaymentMethods
}
PaymentMethods: // TODO: Decide on this container’s structure.
CardAccount: {
AccountName,
CardHolder,
CardNumber,
CVV,
ExpirationDate,
CardType
}
BankAccount: {
AccountName,
AccountNumber,
RoutingNumber,
AccountType
}
A few important notes about what we’ve just done here:
- We don’t specify any concrete data types.
- All fields within our models have the capacity to be either containers or primitives.
- We’re able to defer a model’s structural definition without affecting the pace of our planning.
- Any changes to a particular field type will automatically propagate in our structural definitions, making it trivial to create a definition like
ExpirationDate: String
and later change it toExpirationDate: DateObject
. - The amount of information we need to think about is reduced down to the very bare minimum.
- By deferring the definition of the
PaymentMethods
structure, we find ourselves more inclined to focus on the more concrete payment method definitions from the very beginning, rather than trying to force them to be compatible through an interface. - We focused only on data representation, ensuring that representation and implementation are both separate and can be handled differently if needed.
SOLIDifying Our Conceptual Model
In object-oriented programming (OOP), there’s a generally recommended set of principles to follow, represented by the acronym “SOLID”:
- Single responsibility.
- Open/closed.
- Liskov substitution.
- Interface segregation.
- Dependency inversion.
These “SOLID” principles were defined to help resolve common, recurring design problems and anti-patterns in OOP.
Of particular note for us is the last one, the “dependency inversion” principle. The idea behind this principle is that implementation details should depend on abstractions, not the other way around. Our new conceptual model obeys the dependency inversion principle by prioritizing a focus on abstractions while leaving implementation details to the future class definitions that are based on our abstractions. By doing so, we limit the elements involved in our planning and problem-solving stages to only what is necessary.
Final Thoughts
The consequences of such a conceptual model extend well beyond simply planning out data and object structures. For example, if implemented as an actual programming or language construct, you could make the parsing of your data fairly simple. By implementing an object parser that performs reflection on some passed object, you can extract all of the publicly accessible object properties of the target object and the data contained therein. Thus, if your language doesn’t have a built-in JSON encoding function and no library yet exists, you could recursively traverse your data structure to generate the appropriate JSON with very little effort.
Many of the most fundamental programming concepts, like data structures ultimately being nothing more than trees at their most abstract representation, are things we tend to take for granted and think very little about. By making ourselves conscious of these fundamental concepts, however, we can more effectively take advantage of them.
Additionally, successful programmers typically solve a programming problem before they’ve ever written a single line of code. Whether or not they’re conscious of it, the tools they use to solve these problems effectively consist largely of the myriad conceptual models they’ve collected and developed over time, and the experience they’ve accumulated to determine which conceptual models need to be utilized to solve a particular problem.
Even when you have a solid grasp of your programming fundamentals, you should always revisit them every now and then. Sometimes there are details that you may have missed or just couldn’t fully appreciate when you learned about them. This is something that I’m continually reminded of as I continue on in my own career growth, and I hope that I can continue passing these lessons on to others.
As always, I'm absolutely open to feedback and questions!