Activity

Votes

Comments

New

All activity

Showing only topics in ~comp with the tag "code quality". Back to normal view / Search all groups

Code Quality Tip: The importance of understanding correctness vs. accuracy.
- programming.code quality tips
Text 1740 words
Preface It's not uncommon for a written piece of code to be both brief and functionality correct, yet difficult to reason about. This is especially true of recursive algorithms, which can require...

Preface

It's not uncommon for a written piece of code to be both brief and functionality correct, yet difficult to reason about. This is especially true of recursive algorithms, which can require some amount of simulating the algorithm mentally (or on a whiteboard) on smaller problems to try to understand the underlying logic. The more you have to perform these manual simulations, the more difficult it becomes to track what exactly is going on at any stage of computation. It's also not uncommon that these algorithms can be made easier to reason about with relatively small changes, particularly in the way you conceptualize the solution to the problem. Our goal will be to take a brief tour into what these changes might look like and why they are effective at reducing our mental overhead.

Background

We will consider the case of the subset sum problem, which is essentially a special case of the knapsack problem where you have a finite number of each item and each item's value is equal to its weight. In short, the problem is summarized as one of the following:
- Given a set of numbers, is there a subset whose sum is exactly equal to some target value?
- Given a set of numbers, what is the subset whose sum is the closest to some target value without exceeding it?
For example, given the set of numbers {1, 3, 3, 5} and a target value of 9, the answer for both of those questions is {1, 3, 5} because the sum of those numbers is 9. For a target value of 10, however, the first question has no solution because no combination of numbers in the set {1, 3, 3, 5} produces a total of 10, but the second question produces a solution of {1, 3, 5} because 9 is the closest value to 10 that those numbers can produce without going over.

A Greedy Example

We'll stick to the much simpler case of finding an exact match to our target value so we don't have to track what the highest value found so far is. To make things even simpler, we'll consider the case where all numbers are positive, non-zero integers. This problem can be solved with some naive recursion--simply try all combinations until either a solution is found or all combinations have been exhausted. While more efficient solutions exist, naive recursion is the easiest to conceptualize.

An initial assessment of the problem seems simple enough. Our solution is defined as the set of array elements whose total is equal to our target value. To achieve this, we loop through each of the elements in the array, try combinations with all of the remaining elements, and keep track of what the current total is so we can compare it to our target. If we find an exact match, we return an array containing the matching elements, otherwise we return nothing. This gives us something like the following:
```
function subsetSum($target_sum, $values, $total = 0) {
    // Base case: a total exceeding our target sum is a failure.
    if($total > $target_sum) {
        return null;
    }

    // Base case: a total matching our target sum means we've found a match.
    if($total == $target_sum) {
        return array();
    }

    foreach($values as $index=>$value) {
        // Recursive case: try combining the current array element with the remaining elements.
        $result = subsetSum($target_sum, array_slice($values, $index + 1), $total + $value);

        if(!is_null($result)) {
            return array_merge(array($value), $result);
        }
    }

    return null;
}
```
Your Scope is Leaking

This solution works. It's functionally correct and will produce a valid result every single time. From a purely functional perspective, nothing is wrong with it at all; however, it's not easy to follow what's going on despite how short the code is. If we look closely, we can tell that there are a few major problems:
- It's not obvious at first glance whether or not the programmer is expected to provide the third argument. While a default value is provided, it's not clear if this value is only a default that should be overridden or if the value should be left untouched. This ambiguity means relying on documentation to explain the intention of the third argument, which may still be ignored by an inattentive developer.
- The base case where a failure occurs, i.e. when the accumulated total exceeds the target sum, occurs one stack frame further into the recursion than when the total has been incremented. This forces us to consider not only the current iteration of recursion, but one additional iteration deeper in order to track the flow of execution. Ideally an iteration of recursion should be conceptually isolated from any other, limiting our mental scope to only the current iteration.
- We're propagating an accumulating total that starts from 0 and increments toward our target value, forcing us to to track two different values simultaneously. Ideally we would only track one value if possible. If we can manage that, then the ambiguity of the third argument will be eliminated along with the argument itself.
Overall, the amount of code that the programmer needs to look at and the amount of branching they need to follow manually is excessive. The function is only 22 lines long, including whitespace and comments, and yet the amount of effort it takes to ensure you're understanding the flow of execution correctly is pretty significant. This is a pretty good indicator that we probably did something wrong. Something so simple and short shouldn't take so much effort to understand.

Patching the Leak

Now that we've assessed the problems, we can see that our original solution isn't going to cut it. We have a couple of ways we could approach fixing our function: we can either attempt to translate the abstract problems into tangible solutions or we can modify the way we've conceptualized the solution. With that in mind, let's take a second crack at this problem by trying the latter.

We've tried taking a look at this problem from a top-down perspective: "given a target value, are there any elements that produce a sum exactly equal to it?" Clearly this perspective failed us. Instead, let's try flipping the equation: "given an array element, can it be summed with others to produce the target value?"

This fundamentally changes the way we can think about the problem. Previously we were hung up on the idea of keeping track of the current total sum of the elements we've encountered so far, but that approach is incompatible with the way we're thinking of this problem now. Rather than incrementing a total, we now find ourselves having to do something entirely different: if we want to know if a given array element is part of the solution, we need to first subtract the element from the problem and find out if the smaller problem has a solution. That is, to find if the element 3 is part of the solution for the target sum of 8, then we're really asking if 3 + solutionFor(5) is valid.

The new solution therefore involves looping over our array elements just as before, but this time we check if there is a solution for the target sum minus the current array element:
```
function subsetSum($target_sum, $values) {
    // Base case: the solution to the target sum of 0 is the empty set.
    if($target_sum === 0) {
        return array();
    }

    foreach($values as $index=>$value) {
        // Base case: any element larger than our target sum cannot be part of the solution.
        if($value > $target_sum) {
            continue;
        }

        // Recursive case: do the remaining elements create a solution for the sub-problem?
        $result = subsetSum($target_sum - $value, array_slice($values, $index + 1));

        if(!is_null($result)) {
            return array_merge(array($value), $result);
        }
    }

    return null;
}
```
A Brief Review

With the changes now in place, let's compare our two functions and, more importantly, compare our new function to the problems we assessed with the original. A few brief points:
- Both functions are the same exact length, being only 22 lines long with the same number of comments and an identical amount of whitespace.
- Both functions touch the same number of elements and produce the same output given the same input. Apart from a change in execution order of a base case, functionality is nearly identical.
- The new function no longer requires thinking about the scope of next iteration of recursion to determine whether or not an array element is included in the result set. The base case for exceeding the target sum now occurs prior to recursion, keeping the scope of the value comparison nearest where those values are defined.
- The new function no longer uses a third accumulator argument, reducing the number of values to be tracked and removing the issue of ambiguity with whether or not to include the third argument in top-level calls.
- The new function is now defined in terms of finding the solutions to increasingly smaller target sums, making it easier to determine functional correctness.
Considering all of the above, we can confidently state that the second function is easier to follow, easier to verify functional correctness for, and less confusing for anyone who needs to use it. Although the two functions are nearly identical, the second version is clearly and objectively better than the original. This is because despite both being functionally correct, the first function does a poor job at accurately defining the problem it's solving while the second function is clear and accurate in its definition.

Correct code isn't necessarily accurate code. Anyone can write code that works, but writing code that accurately defines a problem can mean the difference between understanding what you're looking at, and being completely bewildered at how, or even why, your code works in the first place.

Final Thoughts

Accurately defining a problem in code isn't easy. Sometimes you'll get it right, but more often than not you'll get it wrong on the first go, and it's only after you've had some distance from you original solution that you realize that you should've done things differently. Despite that, understanding the difference between functional correctness and accuracy gives you the opportunity to watch for obvious inaccuracies and keep them to a minimum.

In the end, even functionally correct, inaccurate code is worth more than no code at all. No amount of theory is a replacement for practical experience. The only way to get better is to mess up, assess why you messed up, and make things just a little bit better the next time around. Theory just makes that a little easier.
0 comments

Emerald_Knight

June 28, 2019

17 votes

Code Quality Tip: Cyclomatic complexity in depth.

programming.code quality tips

Text 1314 words

Preface Recently I briefly touched on the subject of cyclomatic complexity. This is an important concept for any programmer to understand and think about as they write their code. In order to...

Preface

Recently I briefly touched on the subject of cyclomatic complexity. This is an important concept for any programmer to understand and think about as they write their code. In order to provide a more solid understanding of the subject, however, I feel that I need to address the topic more thoroughly with a more practical example.

What is cyclomatic complexity?

The concept of "cyclomatic complexity" is simple: the more conditional branching and looping in your code, the more complex--and therefore the more difficult to maintain--that code is. We can visualize this complexity by drawing a diagram that illustrates the flow of logic in our program. For example, let's take the following toy example of a user login attempt:

<?php

$login_data = getLoginCredentialsFromInput();

$login_succeeded = false;
$error = '';
if(usernameExists($login_data['username'])) {
    $user = getUser($login_data['username']);
    
    if(!isDeleted($user)) {
        if(!isBanned($user)) {
            if(!loginRateLimitReached($user)) {
                if(passwordMatches($user, $login_data['password'])) {
                    loginUser($user);
                    $login_succeeded = true;
                } else {
                    $error = getBadPasswordError();
                    logBadLoginAttempt();
                }
            } else {
                $error = getLoginRateLimitError($user);
            }
        } else {
            $error = getUserBannedError($user);
        }
    } else {
        $error = getUserDeletedError($user);
    }
} else {
    $error = getBadUsernameError($login_data['username']);
}

if($login_succeeded) {
    sendSuccessResponse();
} else {
    sendErrorResponse($error);
}

?>

A diagram for this logic might look something like this:

+-----------------+
|                 |
|  Program Start  |
|                 |
+--------+--------+
         |
         |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|    Username     +--->+    Set Error    +--+
|    Exists?      | No |                 |  |
|                 |    +-----------------+  |
+--------+--------+                         |
         |                                  |
     Yes |                                  |
         v                                  |
+--------+--------+    +-----------------+  |
|                 |    |                 |  |
|  User Deleted?  +--->+    Set Error    +->+
|                 | Yes|                 |  |
+--------+--------+    +-----------------+  |
         |                                  |
      No |                                  |
         v                                  |
+--------+--------+    +-----------------+  |
|                 |    |                 |  |
|  User Banned?   +--->+    Set Error    +->+
|                 | Yes|                 |  |
+--------+--------+    +-----------------+  |
         |                                  |
      No |                                  |
         v                                  |
+--------+--------+    +-----------------+  |
|                 |    |                 |  |
|   Login Rate    +--->+    Set Error    +->+
| Limit Reached?  | Yes|                 |  |
|                 |    +-----------------+  |
+--------+--------+                         |
         |                                  |
      No |                                  |
         v                                  |
+--------+--------+    +-----------------+  |
|                 |    |                 |  |
|Password Matches?+--->+    Set Error    +->+
|                 | No |                 |  |
+--------+--------+    +-----------------+  |
         |                                  |
     Yes |                                  |
         v                                  |
+--------+--------+    +----------+         |
|                 |    |          |         |
|   Login User    +--->+ Converge +<--------+
|                 |    |          |
+-----------------+    +---+------+
                           |
                           |
         +-----------------+
         |
         v
+--------+--------+
|                 |
|   Succeeded?    +-------------+
|                 | No          |
+--------+--------+             |
         |                      |
     Yes |                      |
         v                      v
+--------+--------+    +--------+--------+
|                 |    |                 |
|  Send Success   |    |   Send Error    |
|    Message      |    |    Message      |
|                 |    |                 |
+-----------------+    +-----------------+

It's important to note that between nodes in this directed graph, you can find certain enclosed regions being formed. Specifically, each conditional branch that converges back into the main line of execution generates an additional region. The number of these distinct enclosed regions is directly proportional to the level of cyclomatic complexity of the system--that is, more regions means more complicated code.

Clocking out early.

There's an important piece of information I noted when describing the above example:

. . . each conditional branch that converges back into the main line of execution generates an additional region.

The above example is made complex largely due to an attempt to create a single exit point at the end of the program logic, causing these conditional branches to converge and thus generate the additional enclosed regions within our diagram.

But what if we stopped trying to converge back into the main line of execution? What if, instead, we decided to interrupt the program execution as soon as we encountered an error? Our code might look something like this:

<?php

$login_data = getLoginCredentialsFromInput();

if(!usernameExists($login_data['username'])) {
    sendErrorResponse(getBadUsernameError($login_data['username']));
    return;
}

$user = getUser($login_data['username']);
if(isDeleted($user)) {
    sendErrorResponse(getUserDeletedError($user));
    return;
}

if(isBanned($user)) {
    sendErrorResponse(getUserBannedError($user));
    return;
}

if(loginRateLimitReached($user)) {
    logBadLoginAttempt($user);
    sendErrorResponse(getLoginRateLimitError($user));
    return;
}

if(!passwordMatches($user, $login_data['password'])) {
    logBadLoginAttempt($user);
    sendErrorResponse(getBadPasswordError());
    return;
}

loginUser($user);
sendSuccessResponse();

?>

Before we've even constructed a diagram for this logic, we can already see just how much simpler this logic is. We don't need to traverse a tree of if statements to determine which error message has priority to be sent out, we don't need to attempt to follow indentation levels, and our behavior on success is right at the very end and at the lowest level of indentation, where it's easily and obviously located at a glance.

Now, however, let's verify this reduction in complexity by examining the associated diagram:

+-----------------+
|                 |
|  Program Start  |
|                 |
+--------+--------+
         |
         |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|    Username     +--->+   Send Error    |
|    Exists?      | No |    Message      |
|                 |    |                 |
+--------+--------+    +-----------------+
         |
     Yes |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|  User Deleted?  +--->+   Send Error    |
|                 | Yes|    Message      |
+--------+--------+    |                 |
         |             +-----------------+
      No |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|  User Banned?   +--->+   Send Error    |
|                 | Yes|    Message      |
+--------+--------+    |                 |
         |             +-----------------+
      No |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|   Login Rate    +--->+   Send Error    |
| Limit Reached?  | Yes|    Message      |
|                 |    |                 |
+--------+--------+    +-----------------+
         |
      No |
         v
+--------+--------+    +-----------------+
|                 |    |                 |
|Password Matches?+--->+   Send Error    |
|                 | No |    Message      |
+--------+--------+    |                 |
         |             +-----------------+
     Yes |
         v
+--------+--------+
|                 |
|   Login User    |
|                 |
+--------+--------+
         |
         |
         v
+--------+--------+
|                 |
|  Send Success   |
|    Message      |
|                 |
+-----------------+

Something should immediately stand out here: there are no enclosed regions in this diagram! Furthermore, even our new diagram is much simpler to follow than the old one was.

Reality is rarely simple.

The above is a really forgiving example. It has no loops, and loops are going to create enclosed regions that can't be broken apart so easily; it has no conditional branches that are so tightly coupled with the main path of execution that they can't be broken up; and the scope of functionality and side effects are minimal. Sometimes you can't break those regions up. So what do we do when we inevitably encounter these cases?

High cyclomatic complexity in your program as a whole is inevitable for sufficiently large projects, especially in a production environment, and your efforts to reduce it can only go so far. In fact, I don't recommend trying to remove all or even most instances of cyclomatic complexity at all--instead, you should just be keeping the concept in mind to determine whether or not a function, method, class, module, or other component of your system is accumulating technical debt and therefore in need of refactoring.

At this point, astute readers might ask, "How does refactoring help if the cyclomatic complexity doesn't actually go away?", and this is a valid concern. The answer to that is simple, however: we're hiding complexity behind abstractions.

To test this, let's forget about cyclomatic complexity for a moment and instead focus on simplifying the refactored version of our toy example using abstraction:

<?php

function handleLoginAttempt($login_data) {
    if(!usernameExists($login_data['username'])) {
        sendErrorResponse(getBadUsernameError($login_data['username']));
        return;
    }

    $user = getUser($login_data['username']);
    if(isDeleted($user)) {
        sendErrorResponse(getUserDeletedError($user));
        return;
    }

    if(isBanned($user)) {
        sendErrorResponse(getUserBannedError($user));
        return;
    }

    if(loginRateLimitReached($user)) {
        logBadLoginAttempt($user);
        sendErrorResponse(getLoginRateLimitError($user));
        return;
    }

    if(!passwordMatches($user, $login_data['password'])) {
        logBadLoginAttempt($user);
        sendErrorResponse(getBadPasswordError());
        return;
    }

    loginUser($user);
    sendSuccessResponse();
}

$login_data = getLoginCredentialsFromInput();

handleLoginAttempt($login_data);

?>

The code above is functionally identical to our refactored example from earlier, but has an additional abstraction via a function. Now we can diagram this higher-level abstraction as follows:

+-----------------+
|                 |
|  Program Start  |
|                 |
+--------+--------+
         |
         |
         v
+--------+--------+
|                 |
|  Attempt Login  |
|                 |
+-----------------+

This is, of course, a pretty extreme example, but this is how we handle thinking about complex program logic. We abstract it down to the barest basics so that we can visualize, in its simplest form, what the program is supposed to do. We don't actually care about the implementation unless we're digging into that specific part of the system, because otherwise we would be so bogged down by the details that we wouldn't be able to reason about what our program is supposed to do.

Likewise, we can use these abstractions to hide away the cyclomatic complexity underlying different components of our software. This keeps everything clean and clutter-free in our head. And the more we do to keep our smaller components simple and easy to think about, the easier the larger components are to deal with, no matter how much cyclomatic complexity all of those components share as a collective.

Final Thoughts

Cyclomatic complexity isn't a bad thing to have in your code. The concept itself is only intended to be used as one of many tools to assess when your code is accumulating too much technical debt. It's a warning sign that you may need to change something, nothing more. But it's an incredibly useful tool to have available to you and you should get comfortable using it.

As a general rule of thumb, you can usually just take a glance at your code and assess whether or not there's too much cyclomatic complexity in a component by looking for either of the following:

Too many loops and/or conditional statements nested within each other, i.e. you have a lot of indentation.
Many loops in the same function/method.

It's not a perfect rule of thumb, but it's useful for at least 90% of your development needs, and there will inevitably be cases where you will prefer to accept some greater cyclomatic complexity because there is some benefit that makes it a better trade-off. Making that judgment is up to you as a developer.

As always, I'm more than willing to listen to feedback and answer any questions!

25 votes

Staticcheck 2019.1 released: a static analysis tool for Go programs

Article 937 words

0 comments

staticcheck.io

January 2, 2019

4 votes
Code Quality Tip: Wrapping external libraries.
- programming.code quality tips
Text 1053 words
Preface Occasionally I feel the need to touch on the subject of code quality, particularly because of the importance of its impact on technical debt, especially as I continue to encounter the...

Preface

Occasionally I feel the need to touch on the subject of code quality, particularly because of the importance of its impact on technical debt, especially as I continue to encounter the effects of technical debt in my own work and do my best to manage it. It's a subject that is unfortunately not emphasized nearly enough in academia.

Background

As a refresher, technical debt is the long-term cost of the design decisions in your code. These costs can manifest in different ways, such as greater difficulty in understanding what your code is doing or making non-breaking changes to it. More generally, these costs manifest as additional time and resources being spent to make some kind of change.

Sometimes these costs aren't things you think to consider. One such consideration is how difficult it might be to upgrade a specific technology in your stack. For example, what if you've built a back-end system that integrates with AWS and you suddenly need to upgrade your SDK? In a small project this might be easy, but what if you've built a system that you've been maintaining for years and it relies heavily on AWS integrations? If the method names, namespaces, argument orders, or anything else has changed between versions, then suddenly you'll need to update every single reference to an AWS-related tool in your code to reflect those changes. In larger software projects, this could be a daunting and incredibly expensive task, spanning potentially weeks or even months of work and testing.

That is, unless you keep those references to a minimum.

A Toy Example

This is where "wrapping" your external libraries comes into play. The concept of "wrapping" basically means to create some other function or object that takes care of operating the functions or object methods that you really want to target. One example might look like this:
```
<?php

class ImportedClass {
    public function methodThatMightBecomeModified($arg1, $arg2) {
        // Do something.
    }
}

class ImportedClassWrapper {
    private $class_instance = null;

    private function getInstance() {
        if(is_null($this->class_instance)) {
            $this->class_instance = new ImportedClass();
        }

        return $this->class_instance;
    }

    public function wrappedMethod($arg1, $arg2) {
        return $this->getInstance()->methodThatMightBecomeModified($arg1, $arg2);
    }
}

?>
```
Updating Tools Doesn't Have to Suck

Imagine that our ImportedClass has some important new features that we need to make use of that are only available in the most recent version, and we're several versions behind. The problem, of course, is that there were a lot of changes that ended up being made between our current version and the new version. For example, ImportedClass is now called NewImportedClass. On top of that, methodThatMightBecomeModified is now called methodThatWasModified, and the argument order ended up getting switched around!

Now imagine that we were directly calling new ImportedClass() in many different places in our code, as well as directly invoking methodThatMightBecomeModified:
```
<?php

$imported_class_instance = new ImportedClass();
$imported_class_instance->methodThatMightBeModified($val1, $val2);

?>
```
For every single instance in our code, we need to perform a replacement. There is a linear or--in terms of Big-O notation--a complexity of O(n) to make these replacements. If we assume that we only ever used this one method, and we used it 100 times, then there are 100 instances of new ImportClass() to update and another 100 instances of the method invocation, equaling 200 lines of code to change. Furthermore, we need to remember each of the replacements that need to be made and carefully avoid making any errors in the process. This is clearly non-ideal.

Now imagine that we chose instead to use the wrapper object:
```
<?php

$imported_class_wrapper = new ImportedClassWrapper();
$imported_class_wrapper->wrappedMethod($val1, $val2);

?>
```
Our updates are now limited only to the wrapper class:
```
<?php

class ImportedClassWrapper {
    private $class_instance = null;

    private function getInstance() {
        if(is_null($this->class_instance)) {
            $this->class_instance = new NewImportedClass();
        }

        return $this->class_instance;
    }

    public function wrappedMethod($arg1, $arg2) {
        return $this->getInstance()->methodThatWasModified($arg2, $arg1);
    }
}

?>
```
Rather than making changes to 200 lines of code, we've now made changes to only 2. What was once an O(n) complexity change has now turned into an O(1) complexity change to make this upgrade. Not bad for a few extra lines of code!

A Practical Example

Toy problems are all well and good, but how does this translate to reality?

Well, I ran into such a problem myself once. Running MongoDB with PHP requires the use of an external driver, and this driver provides an object representing a MongoDB ObjectId. I needed to perform a migration from one hosting provider over to a new cloud hosting provider, with the application and database services, which were originally hosted on the same physical machine, hosted on separate servers. For security reasons, this required an upgrade to a newer version of MongoDB, which in turn required an upgrade to a newer version of the driver.

This upgrade resulted in many of the calls to new MongoId() failing, because the old version of the driver would accept empty strings and other invalid ID strings and default to generating a new ObjectId, whereas the new version of the driver treated invalid ID strings as failing errors. And there were many, many cases where invalid strings were being passed into the constructor.

Even after spending hours replacing the (literally) several dozen instances of the constructor calls, there were still some places in the code where invalid strings managed to get passed in. This made for a very costly upgrade.

The bugs were easy to fix after the initial replacements, though. After wrapping new MongoId() inside of a wrapper function, a few additional conditional statements inside of the new function resolved the bugs without having to dig around the rest of the code base.

Final Thoughts

This is one of those lessons that you don't fully appreciate until you've experienced the technical debt of an unwrapped external library first-hand. Code quality is an active effort, but a worthwhile one. It requires you to be willing to throw away potentially hours or even days of work when you realize that something needs to change, because you're thinking about how to keep yourself from banging your head against a wall later down the line instead of thinking only about how to finish up your current task.

"Work smarter, not harder" means putting in some hard work upfront to keep your technical debt under control.

That's all for now, and remember: don't be fools, wrap your external tools.
6 comments

Emerald_Knight

November 15, 2018

23 votes