24 votes

Programming Challenge: Merge an arbitrary number of arrays in sorted order.

It looks like it's been over a week and a half since our last coding challenge, so let's get one going. This challenge is a relatively simple one, but it's complex enough that you can take a variety of different approaches to it.

As the title suggests, write a program that accepts an arbitrary number of arrays, in whatever form or manner you see fit (if you want to e.g. parse a potentially massive CSV file, then go nuts!), and returns a single array containing all of the elements of the other arrays in sorted order. That's it!

Bonus points for creative, efficient, or generalized solutions!

26 comments

  1. [2]
    mironimous
    (edited )
    Link
    Here is my sed solution: :a s/[^0-9,]//g s/(,|$)+/%,/g :b s/$/@0@1@2@3@4@5@6@7@8@9@/ :c s/^([0-9]*)([0-9])%([0-9]*,)(.*@\2[^@]*)@/\4\1%\2\3@/ tc s/@(.|$)//g /(^|,)[^%]/{s/(,|^)%/\10%/g; bb}...

    Here is my sed solution:

    :a
    s/[^0-9,]//g
    s/(,|$)+/%,/g
    :b
    s/$/@0@1@2@3@4@5@6@7@8@9@/
    :c
    s/^([0-9]*)([0-9])%([0-9]*,)(.*@\2[^@]*)@/\4\1%\2\3@/
    tc
    s/@(.|$)//g
    /(^|,)[^%]/{s/(,|^)%/\10%/g;
    bb}
    s/%0*([^,])|,$/\1/g
    s/^,//
    N
    s/\n/,/
    ba
    

    Inputs are comma separated lists of numbers, each newline the currently sorted list is outputted, so if you want the whole list, pipe it into tail -n1.

    Internally it uses radix sort (@0@1@2@3@4@5@6@7@8@9@ are the marks for the buckets and it adds padding zeros in front of the numbers as needed and removes them at the end).

    Edit: Edited it so now it only outputs the list at EOF.

    22 votes
    1. balooga
      Link Parent
      You monster. That is amazing.

      You monster.

      That is amazing.

      6 votes
  2. [2]
    Comment deleted by author
    Link
    1. jacebot
      Link Parent
      This is pretty sexy too. 3 lines and done. 2 if you don't count the import. Nice.

      This is pretty sexy too. 3 lines and done. 2 if you don't count the import. Nice.

  3. [3]
    666
    Link
    This is the least efficient I can make it (Python 3), without adding irrelevant code in between: def sort_arrays(*a): return...

    This is the least efficient I can make it (Python 3), without adding irrelevant code in between:

    def sort_arrays(*a):
      return list(sorted(__import__('functools').reduce(__import__('operator').concat, a)))
    

    Just call it like this: sort_arrays([5,7,1], [2,8,3], [9,5,7])

    7 votes
    1. jacebot
      Link Parent
      Now this is elegant, and mostly readable.

      Now this is elegant, and mostly readable.

      3 votes
  4. jonluca
    Link
    Hmm fun. Naive (straight forward): def sort_arrays(*arrays): compiled = list() for array in arrays: compiled += array print(compiled.sort()) It's got a time complexity of O(nlogm) where n is the...

    Hmm fun.

    Naive (straight forward):

    def sort_arrays(*arrays):
    	compiled = list()
    	for array in arrays:
    		compiled += array 
    	print(compiled.sort())
    

    It's got a time complexity of O(nlogm) where n is the total number of elements in all arrays. I'll try to come up with some more fun ones when I get home (maybe a generator? multithreaded?)

    5 votes
  5. Celeo
    (edited )
    Link
    Super simple Python. Read a line, create a new list of those values as _int_s, throw it on to a master list, then unpack everything and sort it. import sys arrays = [] for line in sys.stdin: line...

    Super simple Python. Read a line, create a new list of those values as _int_s, throw it on to a master list, then unpack everything and sort it.

    import sys
    
    
    arrays = []
    for line in sys.stdin:
        line = line.strip()
        if not line:
            break
        arrays.append([int(e) for e in line.split(' ') if e])
    print(sorted(i for j in arrays for i in j))
    

    The easiest thing to do is cast to ints when reading off stdin, but sorted supports a key parameter to specify how to sort the data as well.


    Similar thing in Go, but only keeping a single slice of ints and appending to it each time. Not likely a performant usage of append.

    package main
    
    import (
    	"bufio"
    	"fmt"
    	"os"
    	"sort"
    	"strconv"
    	"strings"
    )
    
    func main() {
    	var ints []int
    	reader := bufio.NewReader(os.Stdin)
    	for {
    		text, err := reader.ReadString('\n')
    		if err != nil {
    			panic(err)
    		}
    		text = strings.TrimSpace(text)
    		if text == "" {
    			break
    		}
    		for _, part := range strings.Split(text, " ") {
    			partVal, err := strconv.Atoi(strings.TrimSpace(part))
    			if err != nil {
    				panic(err)
    			}
    			ints = append(ints, partVal)
    		}
    	}
    	sort.Ints(ints)
    	fmt.Println(ints)
    }
    
    
    5 votes
  6. RiderOfGiraffes
    Link
    Python2 def sort_list_of_lists( lol ): combined = sum( lol, [] ) combined.sort() return combined That's just the function, I leave the I/O to the front-end people.

    Python2

    def sort_list_of_lists( lol ):
        combined = sum( lol, [] )
        combined.sort()
        return combined
    

    That's just the function, I leave the I/O to the front-end people.

    4 votes
  7. [3]
    balooga
    Link
    Just because of all the PHP umbrage around here lately, I wrote my solution in it. šŸ˜ function combineAndSortArrays(...$params){ $arr = array_reduce($params, 'array_merge', []); sort($arr); return...

    Just because of all the PHP umbrage around here lately, I wrote my solution in it. šŸ˜

    function combineAndSortArrays(...$params){
            $arr = array_reduce($params, 'array_merge', []);
            sort($arr);
            return $arr;
    }
    

    combineAndSortArrays() is a variadic function that accepts as many arrays as you can throw at it. It combines them into a single array and then sorts it by value.

    4 votes
    1. [2]
      Klayy
      Link Parent
      You can actually do $arr = array_merge(...$params);

      You can actually do
      $arr = array_merge(...$params);

      1 vote
  8. [6]
    Soptik
    (edited )
    Link
    I'll use my own sorting algorithm (but I'm sure it already exists): Theory Let's have a method that gets unlimited amount of arrays (params int[][] in c#). We build some collection, probably...

    I'll use my own sorting algorithm (but I'm sure it already exists):

    Theory

    Let's have a method that gets unlimited amount of arrays (params int[][] in c#). We build some collection, probably binary tree. For each element in each of the arrays, I'll append the number into tree (or custom collection that works very similar to binary tree). Because the arrays are not ordered, we can safely choose first element of first array as tree root, as it's equivalent to choosing random element. I don't know if it's worth balancing the tree in the process.

    After we have all elements, we return the tree (or convert it to array and return it).

    Time complexity should be, where N is number of total elements, N*log(N), plus, if we convert the tree to an array afterwards, N.

    It would be interesting to compare it to the most simple way how to do it, simply add everything to single array and then call .Sort().

    Code

            static void Main(string[] args)
            {
                Random rnd = new Random();
                int[][] input = new int[10][];
                for(int i = 0; i < input.Length; i++)
                {
                    input[i] = new int[10];
                    for(int j = 0; j < input[i].Length; j++)
                    {
                        input[i][j] = rnd.Next(101);
                    }
                }
                Stopwatch sw = new Stopwatch();
                const int tries = 100;
                List<long> t1 = new List<long>(tries);
                List<long> t2 = new List<long>(tries);
                for (int i = 0; i < tries; i++)
                {
                    sw.Start();
                    SortArraysSimpleMethod(input).ToArray(); // Convert result to array, otherwise the code will never be executed, as LINQ uses lazy loading
                    sw.Stop();
                    t1.Add(sw.ElapsedMilliseconds);
                    sw.Reset();
    
                    sw.Start();
                    SortArraysComplexMethod(input);
                    sw.Stop();
                    t2.Add(sw.ElapsedMilliseconds);
                    sw.Reset();
    
                    Console.Write(".");
                }
                Console.Write("\n");
    
                Console.WriteLine(t1.Average());
                Console.WriteLine(t2.Average());
                Console.ReadKey();
                return;
            }
    
    
            private static IEnumerable<int> SortArraysSimpleMethod(params int[][] arrays)
            {
                int[] result = new int[arrays.Select(array => array.Length).Sum()];
                int i = 0;
                foreach(int[] array in arrays)
                {
                    foreach(int element in array)
                    {
                        result[i] = element;
                        i++;
                    }
                }
                return result.OrderBy(element => element);
            }
    
            private static IEnumerable<int> SortArraysComplexMethod(params int[][] arrays)
            {
                Node tree = null;
                foreach(int[] array in arrays)
                {
                    foreach(int element in array)
                    {
                        if(tree == null)
                        {
                            tree = new Node() { value = element };
                        }
                        else
                        {
                            tree.AddChild(element);
                        }
                    }
                }
    
                // Read child
                List<int> values = new List<int>(arrays.Select(array => array.Length).Sum());
                tree.ReadValues(values);
                return values;
            }
    
    
            class Node
            {
                public Node leftChild;
                public Node rightChild;
                public Node parent;
                public int value;
    
                public void AddChild(int value)
                {
                    if(value >= this.value)
                    {
                        if(rightChild == null)
                        {
                            rightChild = new Node() { parent = this, value = value };
                        }
                        else
                        {
                            rightChild.AddChild(value);
                        }
                    }
                    else
                    {
                        if(leftChild == null)
                        {
                            leftChild = new Node() { parent = this, value = value };
                        }
                        else
                        {
                            leftChild.AddChild(value);
                        }
                    }
                }
    
                public void ReadValues(List<int> collection)
                {
                    if(leftChild != null)
                    {
                        leftChild.ReadValues(collection);
                    }
                    collection.Add(value);
                    if(rightChild != null)
                    {
                        rightChild.ReadValues(collection);
                    }
                }
            }
    

    Results

    For small inputs, like 10*10, the results were surprising - the simple method had average (from 100 tries) 0.07ms, the complex one 0.01ms. For input 100*50, the simple one had average 1.62ms, and the complex one 2.13 ms. For input 100*1000, (this time average of 10 tries), the simple one had average 53.9ms and the complex one 1274.2ms.

    Edit: Added syntax highlighting

    4 votes
    1. [5]
      Emerald_Knight
      Link Parent
      Those results aren't surprising at all, actually! Let's say we assume O(n log n) for both approaches and completely disregard any additional algorithmic factors. With a simple array, the allocated...

      Those results aren't surprising at all, actually! Let's say we assume O(n log n) for both approaches and completely disregard any additional algorithmic factors. With a simple array, the allocated memory is sequential and thus your array accesses benefit from locality of reference, whereas your tree isn't constructed of sequentially-allocated memory and thus doesn't have that same benefit. Temporal and spatial locality of reference are concepts that are used to optimize caching behavior, reducing the impact of a cache miss and the subsequent retrieval from slower RAM. It would absolutely have an expected performance impact :)

      3 votes
      1. [4]
        Soptik
        Link Parent
        I was surprised about the result for 10*10. I expected the simple solution to be faster, so I was quite surprised when the complex one was faster. I tested it multiple times, but the complex one...

        I was surprised about the result for 10*10. I expected the simple solution to be faster, so I was quite surprised when the complex one was faster. I tested it multiple times, but the complex one is always faster for 10*10. I'm not sure why is the complex solution faster, but I think, that the LINQ methods I use in simple solution have some overhead and the tree isn't that slower with just 100 elements.

        I thought that the tree approach would be slower just because it has to visit log(n) children when inserting new element, I didn't know about the sequentially-allocated memory :)

        If I created something like linked list instead of tree, allocated all required nodes at once and connected the nodes afterwards - the slowing effect of not-sequentially-allocated memory would disappear, right?

        2 votes
        1. [3]
          Emerald_Knight
          Link Parent
          Regarding the tree being more efficient with 10*10, that could easily be due to amortized time complexity of your inserts. That amortized time complexity likely makes a large difference at the...

          Regarding the tree being more efficient with 10*10, that could easily be due to amortized time complexity of your inserts. That amortized time complexity likely makes a large difference at the beginning, but ends up being completely overshadowed by the impacts of locality of reference with larger arrays.

          Regarding the linked list structure, I wouldn't expect the sequential nature of the nodes to necessarily translate directly into a sequential allocation of memory. It's possible, for example, that a sequence of nodes a -> b -> c -> d -> e -> f is allocated in memory in the order e -> b -> d -> a -> c -> f. In such a case, in an array you would expect that accessing e then accessing f would be efficient, but in this example scenario of a linked list you would find that accessing e then accessing f would be inefficient since they're so far apart in memory.

          Locality of reference has nothing to do with how we structure the data representationally, and everything to do with where that data lives in actual, physical memory, on the hardware.

          1 vote
          1. [2]
            Soptik
            (edited )
            Link Parent
            This is really interesting. for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; wiki However it looks a bit like black magic to me. When you're writing algorithms that...

            This is really interesting.

            A common example is matrix multiplication:

            for i in 0..n
              for j in 0..m
                for k in 0..p
                  C[i][j] = C[i][j] + A[i][k] * B[k][j];
            

            By switching the looping order for j and k, the speedup in large matrix multiplications becomes dramatic, at least for languages that put contiguous array elements in the last dimension.

            On a year 2014 processor, the second case is approximately five times faster than the first case

            wiki

            However it looks a bit like black magic to me. When you're writing algorithms that need to run fast, do you use these techniques? Is it worth really diving into?

            1 vote
            1. Emerald_Knight
              Link Parent
              I personally don't need to handle these kinds of optimizations. But yes, these are used if needed. They're only worth diving into if your applications need to squeeze out every bit of performance...

              I personally don't need to handle these kinds of optimizations. But yes, these are used if needed. They're only worth diving into if your applications need to squeeze out every bit of performance possible, particularly in processing-intensive programs.

              It may seem like black magic, but once you know how data is structured and how it's loaded into cache, it makes perfect sense. It's all just a matter of taking advantage of the architecture you're working with.

              Good on you for doing some additional research on the subject!

              1 vote
  9. Leo
    Link
    Now for Python: fn = lambda x: sorted(sum(x, []))

    Now for Python:

    fn = lambda x: sorted(sum(x, []))

    3 votes
  10. Leo
    (edited )
    Link
    Don't have time to test this quite yet but it should work: #include <algorithm> #include <iterator> #include <vector> template <typename T> std::vector<T>...

    Don't have time to test this quite yet but it should work:

    #include <algorithm>
    #include <iterator>
    #include <vector>
    
    template <typename T>
    std::vector<T> flatten_and_sort(std::vector<std::vector<T>> arrays) {
    	std::vector<T> result;
    	auto it = std::back_inserter(result);
    	
    	for (auto &array : arrays) {
    		std::move(array.begin(), array.end(), it);
    	}
    	
    	std::sort(result.begin(), result.end());
    	
    	return result;
    }
    

    Passing the argument by value allows the caller to choose whether to pass an lvalue to be copied or an rvalue to be consumed.

    Flattening is O(n) where n is the total number of elements in all the arrays, and sorting is O(n log n).

    2 votes
  11. Emerald_Knight
    (edited )
    Link
    I decided to take a database approach to this problem using MongoDB. Given a collection arrays where each document has a field values containing an array of numbers, the following aggregation...

    I decided to take a database approach to this problem using MongoDB.

    Given a collection arrays where each document has a field values containing an array of numbers, the following aggregation pipeline will produce a document with the desired result contained in a final merged field:

    db.arrays.aggregate([
        {$unwind: "$values"},
        {$sort: {
            values: 1
        }},
        {$group: {
            _id: null,
            merged: {$push: "$values"}
        }}
    ]);
    
    2 votes
  12. [3]
    onyxleopard
    Link
    Not sure if using Python's itertools.chain is cheatingā€¦ Python 3.7.0 (default, Jul 15 2018, 14:01:53) Type 'copyright', 'credits' or 'license' for more information IPython 6.5.0 -- An enhanced...

    Not sure if using Python's itertools.chain is cheatingā€¦

    Python 3.7.0 (default, Jul 15 2018, 14:01:53) 
    Type 'copyright', 'credits' or 'license' for more information
    IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
    
    In [1]: from operator import itemgetter
    
    In [2]: from itertools import chain
    
    In [3]: def multi_sort_gen(*iterables, **kwargs):
       ...:     yield from sorted(chain(*iterables), **kwargs)
       ...:     
       
    In [4]: list(multi_sort_gen([(1,3), (1,2), (1,1)], [(2,10), (2,9), (2,8)], key=itemgetter(1)))
    Out[4]: [(1, 1), (1, 2), (1, 3), (2, 8), (2, 9), (2, 10)]
    
    In [5]: list(multi_sort_gen([(1,3), (1,2), (1,1)], [(2,10), (2,9), (2,8)], key=itemgetter(1), reverse=True))
    Out[5]: [(2, 10), (2, 9), (2, 8), (1, 3), (1, 2), (1, 1)]
    
    In [6]: from random import randint
    
    In [7]: arrays = [[randint(0,9) for _ in range(10 ** 3)] for _ in range(10 ** 3)]
    
    In [8]: %timeit result = list(multi_sort_gen(*arrays))
    263 ms Ā± 11.8 ms per loop (mean Ā± std. dev. of 7 runs, 10 loops each)
    
    In [9]: arrays = [[(i, randint(0,9)) for _ in range(10 ** 3)] for i in range(10 ** 3)]
    
    In [10]: %timeit result = list(multi_sort_gen(*arrays, key=itemgetter(1)))
    570 ms Ā± 46.2 ms per loop (mean Ā± std. dev. of 7 runs, 1 loop each)
    

    I learned that heapq.merge exists and it claims:

    Help on function merge in module heapq:
    
    merge(*iterables, key=None, reverse=False)
        Merge multiple sorted inputs into a single sorted output.
        
        Similar to sorted(itertools.chain(*iterables)) but returns a generator,
        does not pull the data into memory all at once, and assumes that each of
        the input streams is already sorted (smallest to largest).
        
        >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
        [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
        
        If *key* is not None, applies a key function to each element to determine
        its sort order.
        
        >>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))
        ['dog', 'cat', 'fish', 'horse', 'kangaroo']
    

    This has the disadvantage that the inputs must be pre-sorted. The advantage, though, is that it does not require all inputs to fit in memory (whereas my solution above does). Nice to know that the tradeoff of space for time can be made here.

    In [12]: import heapq
    
    In [14]: %timeit result = list(heapq.merge(*(sorted(a, key=itemgetter(1)) for a in arrays), key=itemgetter(1)))      
    3.24 s Ā± 252 ms per loop (mean Ā± std. dev. of 7 runs, 1 loop each)
    
    2 votes
    1. [2]
      Emerald_Knight
      Link Parent
      Not really cheating, no. Otherwise we would have to decide where to draw the line at abstractions, which would be a large and fruitless debate not worth getting into.

      Not really cheating, no. Otherwise we would have to decide where to draw the line at abstractions, which would be a large and fruitless debate not worth getting into.

      1. onyxleopard
        Link Parent
        Yeah, itā€™s just a matter of if itā€™s part of the challenge to re-implement something like itertools.chain, which is probably good practice in its own right.

        Yeah, itā€™s just a matter of if itā€™s part of the challenge to re-implement something like itertools.chain, which is probably good practice in its own right.

        1 vote
  13. [2]
    jacebot
    Link
    The magic that is JavaScript ES6... when working through this, I had a similar real world problem this week, so still rather fresh. Get a messy array => into new array of arrays with groups of 3...

    The magic that is JavaScript ES6... when working through this, I had a similar real world problem this week, so still rather fresh. Get a messy array => into new array of arrays with groups of 3 obj. Fun times.

    • Don't hate on the web dev... we program too. Sometimes.*
    let overallArr = [];
    const arr = [[1,2,3], [3,4,5,], [5,6,7,8]]
    
    function sortArr (current, arr) {
      let currentArr = current;
      let tempArr = [];
      
      for (let i=0; i < arr.length; i++) {
        let innerArr = arr[i];
        tempArr = [...tempArr, ...innerArr]
      }
    
    // Not the most efficient in large sets
    // https://gist.github.com/telekosmos/3b62a31a5c43f40849bb#gistcomment-1739870
      return overallArr = new Set(tempArr.sort())
    }
    
    sortArr(overallArr, arr)
    console.log(sortArr(overallArr, arr))
    
    2 votes
    1. Emerald_Knight
      Link Parent
      No hate here. Despite all of the flak it gets, I work primarily with PHP, and occasionally with JS. The only thing I would hate on, if this were in a merge request, would be some minor naming...

      No hate here. Despite all of the flak it gets, I work primarily with PHP, and occasionally with JS.

      The only thing I would hate on, if this were in a merge request, would be some minor naming issues. For a programming challenge like this, though, it's not really an important point. Also, just a heads up, the new Set() call would be undesired for a simple merging of array elements as it strips the duplicates that you would expect to see. Great for getting unique values, though! :)