24
votes
Programming Challenge: Merge an arbitrary number of arrays in sorted order.
It looks like it's been over a week and a half since our last coding challenge, so let's get one going. This challenge is a relatively simple one, but it's complex enough that you can take a variety of different approaches to it.
As the title suggests, write a program that accepts an arbitrary number of arrays, in whatever form or manner you see fit (if you want to e.g. parse a potentially massive CSV file, then go nuts!), and returns a single array containing all of the elements of the other arrays in sorted order. That's it!
Bonus points for creative, efficient, or generalized solutions!
Here is my sed solution:
Inputs are comma separated lists of numbers, each newline the currently sorted list is outputted, so if you want the whole list, pipe it into
tail -n1
.Internally it uses radix sort (
@0@1@2@3@4@5@6@7@8@9@
are the marks for the buckets and it adds padding zeros in front of the numbers as needed and removes them at the end).Edit: Edited it so now it only outputs the list at EOF.
You monster.
That is amazing.
This is pretty sexy too. 3 lines and done. 2 if you don't count the import. Nice.
This is the least efficient I can make it (Python 3), without adding irrelevant code in between:
Just call it like this:
sort_arrays([5,7,1], [2,8,3], [9,5,7])
Now this is elegant, and mostly readable.
You sicken me. Well done!
Hmm fun.
Naive (straight forward):
It's got a time complexity of O(nlogm) where n is the total number of elements in all arrays. I'll try to come up with some more fun ones when I get home (maybe a generator? multithreaded?)
Super simple Python. Read a line, create a new list of those values as _int_s, throw it on to a master list, then unpack everything and sort it.
The easiest thing to do is cast to ints when reading off stdin, but
sorted
supports akey
parameter to specify how to sort the data as well.Similar thing in Go, but only keeping a single slice of ints and appending to it each time. Not likely a performant usage of
append
.Python2
That's just the function, I leave the I/O to the front-end people.
Just because of all the PHP umbrage around here lately, I wrote my solution in it. š
combineAndSortArrays()
is a variadic function that accepts as many arrays as you can throw at it. It combines them into a single array and then sorts it by value.You can actually do
$arr = array_merge(...$params);
Good catch!
I'll use my own sorting algorithm (but I'm sure it already exists):
Theory
Let's have a method that gets unlimited amount of arrays (
params int[][]
in c#). We build some collection, probably binary tree. For each element in each of the arrays, I'll append the number into tree (or custom collection that works very similar to binary tree). Because the arrays are not ordered, we can safely choose first element of first array as tree root, as it's equivalent to choosing random element. I don't know if it's worth balancing the tree in the process.After we have all elements, we return the tree (or convert it to array and return it).
Time complexity should be, where
N
is number of total elements,N*log(N)
, plus, if we convert the tree to an array afterwards,N
.It would be interesting to compare it to the most simple way how to do it, simply add everything to single array and then call
.Sort()
.Code
Results
For small inputs, like 10*10, the results were surprising - the simple method had average (from 100 tries) 0.07ms, the complex one 0.01ms. For input 100*50, the simple one had average 1.62ms, and the complex one 2.13 ms. For input 100*1000, (this time average of 10 tries), the simple one had average 53.9ms and the complex one 1274.2ms.
Edit: Added syntax highlighting
Those results aren't surprising at all, actually! Let's say we assume
O(n log n)
for both approaches and completely disregard any additional algorithmic factors. With a simple array, the allocated memory is sequential and thus your array accesses benefit from locality of reference, whereas your tree isn't constructed of sequentially-allocated memory and thus doesn't have that same benefit. Temporal and spatial locality of reference are concepts that are used to optimize caching behavior, reducing the impact of a cache miss and the subsequent retrieval from slower RAM. It would absolutely have an expected performance impact :)I was surprised about the result for 10*10. I expected the simple solution to be faster, so I was quite surprised when the complex one was faster. I tested it multiple times, but the complex one is always faster for 10*10. I'm not sure why is the complex solution faster, but I think, that the LINQ methods I use in simple solution have some overhead and the tree isn't that slower with just 100 elements.
I thought that the tree approach would be slower just because it has to visit
log(n)
children when inserting new element, I didn't know about the sequentially-allocated memory :)If I created something like linked list instead of tree, allocated all required nodes at once and connected the nodes afterwards - the slowing effect of not-sequentially-allocated memory would disappear, right?
Regarding the tree being more efficient with 10*10, that could easily be due to amortized time complexity of your inserts. That amortized time complexity likely makes a large difference at the beginning, but ends up being completely overshadowed by the impacts of locality of reference with larger arrays.
Regarding the linked list structure, I wouldn't expect the sequential nature of the nodes to necessarily translate directly into a sequential allocation of memory. It's possible, for example, that a sequence of nodes
a -> b -> c -> d -> e -> f
is allocated in memory in the ordere -> b -> d -> a -> c -> f
. In such a case, in an array you would expect that accessinge
then accessingf
would be efficient, but in this example scenario of a linked list you would find that accessinge
then accessingf
would be inefficient since they're so far apart in memory.Locality of reference has nothing to do with how we structure the data representationally, and everything to do with where that data lives in actual, physical memory, on the hardware.
This is really interesting.
wiki
However it looks a bit like black magic to me. When you're writing algorithms that need to run fast, do you use these techniques? Is it worth really diving into?
I personally don't need to handle these kinds of optimizations. But yes, these are used if needed. They're only worth diving into if your applications need to squeeze out every bit of performance possible, particularly in processing-intensive programs.
It may seem like black magic, but once you know how data is structured and how it's loaded into cache, it makes perfect sense. It's all just a matter of taking advantage of the architecture you're working with.
Good on you for doing some additional research on the subject!
Now for Python:
fn = lambda x: sorted(sum(x, []))
Don't have time to test this quite yet but it should work:
Passing the argument by value allows the caller to choose whether to pass an lvalue to be copied or an rvalue to be consumed.
Flattening is
O(n)
where n is the total number of elements in all the arrays, and sorting isO(n log n)
.I decided to take a database approach to this problem using MongoDB.
Given a collection
arrays
where each document has a fieldvalues
containing an array of numbers, the following aggregation pipeline will produce a document with the desired result contained in a finalmerged
field:Not sure if using Python's
itertools.chain
is cheatingā¦I learned that
heapq.merge
exists and it claims:This has the disadvantage that the inputs must be pre-sorted. The advantage, though, is that it does not require all inputs to fit in memory (whereas my solution above does). Nice to know that the tradeoff of space for time can be made here.
Not really cheating, no. Otherwise we would have to decide where to draw the line at abstractions, which would be a large and fruitless debate not worth getting into.
Yeah, itās just a matter of if itās part of the challenge to re-implement something like
itertools.chain
, which is probably good practice in its own right.The magic that is JavaScript ES6... when working through this, I had a similar real world problem this week, so still rather fresh. Get a messy array => into new array of arrays with groups of 3 obj. Fun times.
No hate here. Despite all of the flak it gets, I work primarily with PHP, and occasionally with JS.
The only thing I would hate on, if this were in a merge request, would be some minor naming issues. For a programming challenge like this, though, it's not really an important point. Also, just a heads up, the
new Set()
call would be undesired for a simple merging of array elements as it strips the duplicates that you would expect to see. Great for getting unique values, though! :)