Conceptualizing Data: Simplifying the way we think about complex data structures.
Preface
Conceptual models in programming are essential for being able to reason about problems. We see this through code all the time, with implementation details hidden away behind abstractions like functions and objects so that we can ignore the cumbersome details and focus only on the details that matter. Without these abstractions and conceptual models, we might find ourselves overwhelmed by the size and complexity of the problem we’re facing. Of these conceptual models, one of the most easily neglected is that of data and object structure.
Data Types Galore
Possibly one of the most overwhelming aspects of conceptualizing data and object structure is the sheer breadth of data types available. Depending on the programming language you’re working with, you may find that you have more than several dozens of object classes already defined as part of the language’s core; primitives like booleans, ints, unsigned ints, floats, doubles, longs, strings, chars, and possibly others; arrays that can contain any of the objects or primitives, and even other arrays; and several other data structures like queues, vectors, and mixed-type collections, among others.
With so many types of data, it’s incredibly easy to lose track in a sea of type declarations and find yourself confused and unsure of where to go.
Tree’s Company
Let’s start by trying to make these data types a little less overwhelming. Rather than thinking strictly of types, let’s classify them. We can group all data types into one of three basic classifications:
- Objects, which contain key/value pairs. For example, an object property that stores a string.
- Arrays, which contain some arbitrary number of values.
- Primitives, which contain nothing. They’re simply a “flat” data value.
We can also make a couple of additional notes. First, arrays and objects are very similar; both contain references to internal data, but the way that data is referenced differs. In particular, objects have named keys while arrays have numeric, zero-indexed keys. In a sense, arrays are a special case of objects where the keys are more strictly typed. From this, we can condense the classifications of objects and arrays into the more general “container” classification.
With that in mind, we now have the following classifications:
- Containers.
- Primitives.
We can now generally state that containers may contain other containers and primitives, and primitives may not contain anything. In other words, all data structures are a composition of containers and/or primitives, where containers may accept containers and/or primitives and primitives may not accept anything. More experienced programmers should notice something very familiar about this description--we’re basically describing a tree structure! Primitive types and empty containers act as the leaves in a tree, whereas objects and arrays act as the nodes.
Trees Help You Breathe
Okay, great. So what’s the big deal, anyway? We’ve now traded a bunch of concrete data types that we can actually think about and abstracted them away into this nebulous mess of containers and primitives. What do we get out of this?
A common mistake many programmers make is planning their data types out from the very beginning. Rather than planning out an abstraction for their data and object architecture, it’s easy to immediately find yourself focusing too much on the concrete implementation details.
Imagine, for example, modeling a user account for an online payment system. A common feature to include is the ability to store payment information for auto-pay, and payment methods typically take the form of some combination of credit/debit cards and bank accounts. If we focus on implementation details from the beginning, then we may find ourselves with something like this in a first iteration:
UserAccount: {
username: String,
password: String,
payment_methods: PaymentMethod[]
}
PaymentMethod: {
account_name: String,
account_type: Enum,
account_holder: String,
number: String,
routing_number: String?,
cvv: String?,
expiration_date: DateString?
}
We then find ourselves realizing that PaymentMethod
is an unnecessary mess of optional values and needing to refactor it. Odds are we would break it off immediately into separate account types and make a note that they both implement some interface. We may also find that, as a result, remodeling the PaymentMethod
could result in the need to remodel the UserAccount
. For more deeply nested data structures, a single change deeper within the structure could result in those changes cascading all the way to the top-level object. If we have multiple objects, then these changes could propagate to them as well. And what if we decide a type needs to be changed, like deciding that our expiration date needs to be some sort of date object? Or what if we decide that we want to modify our property names? We’re then stuck having to update these definitions as we go along. What if we decide that we don't want an interface for different payment method types after all and instead want separate collections for each type? Then including the interface consideration will have proven to be a waste of time. The end result is that before we’ve even touched a single line of code, we’ve already found ourselves stuck with a bunch of technical debt, and we’re only in our initial planning stages!
To alleviate these kinds of problems, it’s far better to just ignore the implementation details. By doing so, we may find ourselves with something like this:
UserAccount: {
Username,
Password,
PaymentMethods
}
PaymentMethods: // TODO: Decide on this container’s structure.
CardAccount: {
AccountName,
CardHolder,
CardNumber,
CVV,
ExpirationDate,
CardType
}
BankAccount: {
AccountName,
AccountNumber,
RoutingNumber,
AccountType
}
A few important notes about what we’ve just done here:
- We don’t specify any concrete data types.
- All fields within our models have the capacity to be either containers or primitives.
- We’re able to defer a model’s structural definition without affecting the pace of our planning.
- Any changes to a particular field type will automatically propagate in our structural definitions, making it trivial to create a definition like
ExpirationDate: String
and later change it toExpirationDate: DateObject
. - The amount of information we need to think about is reduced down to the very bare minimum.
- By deferring the definition of the
PaymentMethods
structure, we find ourselves more inclined to focus on the more concrete payment method definitions from the very beginning, rather than trying to force them to be compatible through an interface. - We focused only on data representation, ensuring that representation and implementation are both separate and can be handled differently if needed.
SOLIDifying Our Conceptual Model
In object-oriented programming (OOP), there’s a generally recommended set of principles to follow, represented by the acronym “SOLID”:
- Single responsibility.
- Open/closed.
- Liskov substitution.
- Interface segregation.
- Dependency inversion.
These “SOLID” principles were defined to help resolve common, recurring design problems and anti-patterns in OOP.
Of particular note for us is the last one, the “dependency inversion” principle. The idea behind this principle is that implementation details should depend on abstractions, not the other way around. Our new conceptual model obeys the dependency inversion principle by prioritizing a focus on abstractions while leaving implementation details to the future class definitions that are based on our abstractions. By doing so, we limit the elements involved in our planning and problem-solving stages to only what is necessary.
Final Thoughts
The consequences of such a conceptual model extend well beyond simply planning out data and object structures. For example, if implemented as an actual programming or language construct, you could make the parsing of your data fairly simple. By implementing an object parser that performs reflection on some passed object, you can extract all of the publicly accessible object properties of the target object and the data contained therein. Thus, if your language doesn’t have a built-in JSON encoding function and no library yet exists, you could recursively traverse your data structure to generate the appropriate JSON with very little effort.
Many of the most fundamental programming concepts, like data structures ultimately being nothing more than trees at their most abstract representation, are things we tend to take for granted and think very little about. By making ourselves conscious of these fundamental concepts, however, we can more effectively take advantage of them.
Additionally, successful programmers typically solve a programming problem before they’ve ever written a single line of code. Whether or not they’re conscious of it, the tools they use to solve these problems effectively consist largely of the myriad conceptual models they’ve collected and developed over time, and the experience they’ve accumulated to determine which conceptual models need to be utilized to solve a particular problem.
Even when you have a solid grasp of your programming fundamentals, you should always revisit them every now and then. Sometimes there are details that you may have missed or just couldn’t fully appreciate when you learned about them. This is something that I’m continually reminded of as I continue on in my own career growth, and I hope that I can continue passing these lessons on to others.
As always, I'm absolutely open to feedback and questions!
Thanks for taking the time to write this. I've started writing python scripts and am getting more curious about these topics. I know how to do functions, simple classes, and read documentation to make libraries do what I want. I always feel like my code is inefficient and hacky. Do you have any recommendations for topics to learn more about next?
Side question, what inspires you to write about this?
Regarding recommendations, this is tricky.
Right now you're just starting out, so I wouldn't recommend concerning yourself too much with writing good code yet. Instead, continue working on nailing down your problem solving process and your understanding of programming problems as a whole. For example, if you're the kind of person who currently solves problems by process of trial and error, e.g. copy-pasting from StackOverflow and making changes until things work, then you need to stop what you're doing right now and change the way you approach solving problems. These are the fundamentals from which everything else you do will build on, so it's important to build on those fundamentals as much as you can. Solve lots of problems and lots of different kinds of problems. Keep challenging yourself with harder and harder problems over time.
Quite frankly, no programmer is going to write efficient, non-hacky code when they're just starting out. Learn to accept that right now. You're a novice. You're new to programming. Find me one novice who writes good code and I'll find you a thousand that don't. Writing good code isn't your goal right now. Learning to internalize basic programming constructs is. Writing shitty code is part of the process. It's a rite of passage. Embrace it.
That being said, it doesn't hurt to start thinking about how to fix your code to make it better. My personal recommendation would be for you to revisit old projects after you've been away from them from several months and update them. It's hard to beat the experience of having to modify an existing project that doesn't adhere to good code practices. If you really want to study refactoring code in depth, though, you might look into getting your hands on the book Refactoring: Improving the Design of Existing Code, by Martin Fowler. It could prove to be a good source of reading material for you.
Regarding what inspires me to write about these subjects in general, there are a few things that come to mind.
First, I've had to maintain legacy code. As a result, I've encountered some absolutely horrible design anti-patterns. I've also written some pretty terrible code myself. In both cases, I've had to go in and make changes to the existing code, and it's a painful, frustrating experience every time. I often found myself having to stop myself from procrastinating because I just didn't want to have to touch those parts of the code base. This is what we call "technical debt", and it's measured specifically by the amount of time and effort it takes to make changes to code. I've become very accustomed with technical debt, so it's very often on my mind and being considered whenever I write code.
Second, there are a lot of things I wish I'd known about prior to getting into professional work, things there were never taught during my CS studies. I like to take the opportunity to help other programmers--both new programmers and programmers who just haven't been fortunate enough to be exposed to these ideas--by sharing them whenever I think to do so.
Third, I've just always enjoyed teaching others.
Regarding what inspired me to write about this this subject in particular, however, one thing stands out. I had some code that I'd written last year for my current work. It worked perfectly fine for a while, but as time went on I realized that I kept copy-pasting the same exact code and kept adding more and more edge case checking. As my data structures became more complicated and needed to be extended, it also became more and more of a headache to make changes. It was awful. But then, I'd made a critical realization that I'd taken for granted for so long: everything I was doing could be done in the form of a recursive tree traversal, because these data structures were really just a bunch of nodes and leaves in a tree. That lead to a major overhaul of that particular part of the system and has made it trivial to further extend these structures. What was once a nightmare to maintain is now so ridiculously simple that I could get hit by a bus today and whatever programmer replaced me would have no trouble whatsoever making use of it.
In short, though, it all boils down to this: I love programming and I love taking the opportunity to teach others about it :)
Feedback: Your posts like this do an amazing job covering the fundamentals and reasons behind modern software engineering practice, and if someone were to ask "what am I missing out on in bootcamps/self taught programming", I'd point them your way.
Thank you, I really appreciate that!