9 votes

Discussion: The pros and cons to different approaches to solving a problem.

Posted August 27, 2018 by Emerald_Knight

Tags: discussion, maintainability, efficiency

It's often the case that in academic and self-teaching environments, you don't really have the opportunity to grasp and fully understand situations in which a problem has multiple valid solutions and what the implications are in choosing among them. Among those considerations are two in particular: runtime efficiency and maintainability. When these subjects are discussed, the example solutions are often comical at best, or the problems themselves too complex to fully grasp the situation at hand. Sometimes the problems are also so simple as to be completely worthless, e.g. comparing bubble sort to bogo sort.

As such, I would like to take this opportunity to discuss practical but conceptually simple problems and the implications of the different solutions that are available. Conceptual simplicity is an absolute requirement because we want these problems to be accessible to a wider variety of readers. Problems don't necessarily need to be code-related (you could e.g. discuss something related to server administration). Bonus points for problems that include solutions with an efficiency/maintainability trade-off!

4 comments

[2]
Emerald_Knight (OP)
August 27, 2018
Link
For myself, I'm going to illustrate with a toy problem, specifically this Project Euler problem in which you write consecutive numbers in a square spiral starting from the center beginning at the...

For myself, I'm going to illustrate with a toy problem, specifically this Project Euler problem in which you write consecutive numbers in a square spiral starting from the center beginning at the number 1.

For those of you who want to try solving this problem on your own first, stop reading now and come back later.

To begin, let's consider which solutions exist:

Most, if not all of you, should be able to discover the simplest solution of all, which is to construct a square matrix of side length n, fill it in with the expected values, and traverse the diagonals. This is an O(n^2) solution because of the n * n square that has to be constructed and filled in.

The clever among you should notice that constructing such a matrix is completely unnecessary and that for each layer, you can add 2 * (layer_number - 1) four times to jump between the four corners of each layer. This is an O(n) solution because you're only ever visiting approximately 2 * n values in a loop, skipping all of the values in between.

The more mathematically inclined among you, however, should notice that the diagonal extending from the center to the upper-right corner is a sequence of consecutive odd squares and that the remaining three corners of each layer can be derived using that square number as the starting point, allowing you to construct a formula to compute the sum using the side length n. This is an O(1) solution because you're using the same formula and thus the same number of operations regardless of the size of n.

Now, we can clearly tell that in terms of efficiency, solution #1 is terrible, solution #2 is acceptable, and solution #3 is the holy grail. In the real world, however, there is one thing that this superficial overview doesn't account for: changing project requirements.

Imagine that this spiral square concept was constructed at the request of a client, and that now the client wants us to be able to handle different kinds of numerical sequences other than natural numbers starting at 1, such as the sequence of prime numbers. That throws a wrench right into solution #3 because although it's incredibly efficient, it's not very flexible. We have to write an entirely new mathematical proof for each and every sequence that's requested, and some sequences can't be computed with a simple formula at all.

This even throws a wrench in solution #2 because it depended on even spacing between numbers at each layer--our only hope for using it is if our sequence is defined by a continuous function!

The only reasonable solution now is to have matrix constructing logic for solution #1 that can take some form of number generator for our sequence and fill in the matrix using those generated numbers.

To summarize: Solution #1 has terrible runtime efficiency, but it's easily the simplest solution to implement and the most flexible solution available for handling sudden changes to project requirements which makes it the most maintainable. On the other hand, solution #3 is terrible for handling sudden changes to project requirements and thus the least maintainable, but can't possibly be beat when you need the most optimal solution available for the specific problem it's designed for. Solution #2 is somewhere in the middle of both of those.

With all of that being said, you should always keep one small piece of wisdom in mind: "premature optimization is the root of all evil". In other words, implement solution #1 first, and if it survives all of the inevitable changes in project requirements and it ends up bottlenecking the application, go ahead and build out solution #3. If all else fails, you can revert back to solution #1 and still have something that works at the end of the day!

As an aside, if anyone wants a proof of solution #3, I would be happy to write something up somewhere. It's a fun little problem to solve :)

5 votes
1. Emerald_Knight (OP)
  August 27, 2018
  Link Parent
  An explanation of the corner jumping, in case there's confusion: Start at 1. Add 2 four times to get 3, 5, 7, and 9 which are the corners of layer 2. Now add 4 four times to get 13, 17, 21, and 25...
  
  An explanation of the corner jumping, in case there's confusion:
  
  Start at 1. Add 2 four times to get 3, 5, 7, and 9 which are the corners of layer 2. Now add 4 four times to get 13, 17, 21, and 25 which are the corners of layer 3. Now add 6 four times to get 31, 37, 43, and 49 which are the corners of layer 4. Repeat as many times as is necessary. No matrix required!
  
  3 votes
Celeo
August 28, 2018
Link
I can't think of an example from work that'd be sufficiently interesting without having to set up a large amount of technical background first, so that leaves the programming I do at home. Of the...
I can't think of an example from work that'd be sufficiently interesting without having to set up a large amount of technical background first, so that leaves the programming I do at home. Of the projects I've been working on, some community additions to this site are probably the most interesting to others. Note that this is a post more on the design choices to a new feature rather than a computational tradeoff, but I hope it's still in line with this topic.

Background

For anyone who hasn't come to Tildes from Reddit or similar link aggregation site, reposting is when someone submits a link to an external site that someone else has already submitted previously. The user is resubmitting the post, so they're "reposting." On Reddit, there are a few mechanics to avoid this, the most common being the croud of people that will call the poster a bunch of names for doing so. On Tildes, people so far are much more forgiving of this practice. Now that we have search, we have the start of tooling that should cut down on this event. Search is still in its young stage though, and we can't realistically expect every submission to be done by a user who searched the entire site to make sure the article hasn't been posted before. Thus, it'd be nice to have an additional built-in mechanic for detecting and deterring reposting. Enter: issue 181. Note that the breakdown below partially comes from my ideas, and partially from Deimos. If you want to find out who came up with what, have a peak at that link.

Approaching the problem

Unless I'm certain that a requested change involves only a very small amount of work, it's best to start with planning. When writing code for myself, I can more often get away without a dedicated planning phase, as sometimes I'll be happy with something when I see it. When submitting code to an open-source project, it's best to have a planning phase and then pass the results of that on to the code owner to see what they say. Deimos has been wonderful in responding quickly with great ideas. Contributing to an open-source project is very different than working at a job, but this post will be just looking at this open source project.

So, to this problem: lowering the occurance of reposting. First off, from the description, we have a couple of different requests:
- When the user attempts to make a repost, they "should be warned"
- If the repost is made, it should receive a "repost" tag
- The warning should only happen if it's a recent duplicate
- No warning should be given if it was posted a month ago
- A warning should only be given if the link is being posted in the same group
Those are several good ideas, and it's important to note that it is a collection of ideas, not hard requirements. I could implement all of those requests into a single flow, but when working iteratively, it's good to start from the ground up. And, as I'm only planning right now, I have to start at the base: what happens when they click the submit button.

Options

Now onto the part that actually satisfies this wall of text being posted in this topic: pros/cons of different approaches.

First, we break down the page into its components: "what is there that I can use to implement this feature?" Ignoring the other parts, there is the link input, and the submit button. Additionally, not everyone has seen it, but there is a space for validation errors.

I could:
1. watch for the user typing/pasting in the link and check it once they have, or
2. intercept the submit button and bring up an alert popup
3. use the existing errors feature
4. let them submit, but redirect them to a page that says "that's already been posted"
Option 1 won't work because of how onchange jQuery events are sent - the user could paste their link and then click the submit button without clicking anything else and the code wouldn't be called. Additionally, I'd have to block, somehow, the submission of the form, until that callback resolved and cleared the link as not being a repost. As JavaScript is all asynchronous, this could quickly become complicated.

Option 2 is bad because alerts are bad.

Option 3 is bad because there's no way for the user to "submit anyway," ignoring the errors. It is, however, programatically the simplest. I could resolve the entire issue in less than 10 lines of code. It's a bad user experience, though.

Option 4, then, has to work. It has a number of options in itself, depending on what we want to show to the user:
1. page that has text saying that the link has been posted before, with a link to the topic(s), and a button that says "post anyway"
2. like option 1, but has a button that shows up if the user is also submitting a comment that says "post my comment there"
3. like option 1, but shows the warning about the submission being a repost and link(s) to the topic(s) above the existing form
This set of options is a better example of actually looking into pros and cons of each, as the list above was simpler to break down into what's the best.

Option 1 is the simplest for the user and probably the easiest to code. It's also, however, the most restrictive.

Option 2 is an intersting one. At first glance, it can seem to be better than option 1. After all, it has everything that option 1 does, plus an option to "bring along" the comment. I don't have any data to back this up (I could get the data by crawling the site or by asking for it), but option 2 is only good if the user is also submitting a comment. If most users are submitting links as-is, without a comment, then the additional work to make option 2 work is moot. Additionally, in the use case of there being multiple instances of the link having been posted before, to which topic should the user's comment be posted? And finally, would there comment actually work on that topic. This option has a lot of assumptions and a very small use case.

Option 3 has several tradeoffs that can be identified. First, it'll be more visually complicated than option 1, as there's more on the screen. The potential for users being confused by what they're looking at will be (I hope) reduced by the fact that most of what they're seeing is the screen they were just at, just with some more text and a few links at the top. This option gives the user the ability to submit anyway (option 1 couldn't do this), and allows the user to optionally copy the comment they wanted to make so they can post it on the topic (like option 2, but without making assumptions).Done right, this'll likely be the

Plan done

Thus, the plan is to allow the user to submit the form, redirect them to a page that is that form, but with some text at the top that says the link has already been posted, with a link to the topic(s). The user will have the option to "submit anyway," or they can just go to one of the linked topics and comment there. This solves the originally problem statement that users don't know when they're submitting reposted links, and does so in a manner that gives the most options to the user.

Each option had pros and cons. Some had enough cons to be eliminited immediately, which is always nice, but some really had to be deliberated on (both internally and with Deimos, thanks!).
4 votes
Gyrfalcon
August 28, 2018 (edited August 29, 2018)
Link
This is more implementation oriented, and certainly a language quirk thing, but I think it might belong here. I spent much of my summer working on image processing in MATLAB. Now having done only...
This is more implementation oriented, and certainly a language quirk thing, but I think it might belong here.

I spent much of my summer working on image processing in MATLAB. Now having done only a bit of programming in MATLAB prior to this for homework assignments, I was mostly approaching things from C/Python perspective. Say for a test set of 100, 300x300 images, with 10 points of data at each pixel, I might have written something like this:
```
for i = 1:100
  for j = 1:300
    for k = 1:300
      maxSet(i,j,k) = max(testSet(i,j,k,:));
    endfor
  endfor
endfor
```
At first glance, this seems like a perfectly reasonable way of doing things, and would be pretty simple to understand or port to another language. And you might think you would have to port it to another language, because when I tested that snippet it took 154.3 seconds to run. I'm testing in Octave at the moment but if I was in MATLAB I could maybe use a parfor loop to get a ~4x speed improvement, at the cost of memory from having several copies of all the arrays.

However, it's possible to take advantage of how MATLAB processes data to get a speed improvement of ~580x, for no cost in memory. Originally I was going to show off the wonders of reshape to get the 4d data into a 2d array and then process it, but I realized while writing that even that isn't necessary. MATLAB processing is centered around vectorization, so doing an operation on large vectors and matrices is automagically translated into well optimized, multithreaded code. Because of that, I can replace the code above with:
```
maxSet = max(testSet, [], 4);
```
This snippet gives me identical results to the previous one, but it runs in ~0.27 seconds. This is a huge performance gain. The tradeoff is that your code can be confusing to someone who hasn't done much MATLAB before. Also, if you want to port your code, you need figure out what the implicit logic is before you can craft the appropriate loop(s).

If you'd like to download Octave and test this for yourself, the code I used is below:
```
# Prep workspace
clear, clc, close all;

testSet = rand(100, 300, 300, 10);
maxSet = zeros(100, 300, 300);

# Looping implemenation
tic
for i = 1:100
  for j = 1:300
    for k = 1:300
      maxSet(i,j,k) = max(testSet(i,j,k,:));
    endfor
  endfor
endfor
toc

# Vectorized implemenation
tic
maxSet2 = max(testSet, [], 4);
toc

# Accuracy check
sum(sum(sum(maxSet == maxSet2))) == numel(maxSet)
```
1 vote