15 votes

Command-line Tools can be 235x Faster than your Hadoop Cluster

9 comments

  1. [6]
    pseudolobster
    Link
    Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth...

    Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth considering how a lot of things people consider "big data" really aren't that big to a single modern computer, and throwing large scale solutions at the problem won't always make it faster.

    8 votes
    1. [5]
      biox
      Link Parent
      Paraphrasing from twitter: my life as a consultant: Client: tell us how to optimize our big data platform, its costing us millions Me: your dataset fits in memory Me: that'll be $100,000

      Paraphrasing from twitter:

      my life as a consultant:

      Client: tell us how to optimize our big data platform, its costing us millions

      Me: your dataset fits in memory

      Me: that'll be $100,000

      12 votes
      1. [2]
        crius
        Link Parent
        This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense. Just here on tildes, some time ago, was posted an article of how a...

        This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense.

        Just here on tildes, some time ago, was posted an article of how a developer reduced the memory usage of a very memory intensive processing script from several gigabytes to kilobytes.

        Unfortunately this is just the most obvious symptom of the disconnection between management and IT department in which the latter can easily bullshit its way around until an external audit is requested.

        5 votes
        1. [2]
          Comment deleted by author
          Link Parent
          1. crius
            (edited )
            Link Parent
            It wasn't that one but interesting nonetheless. It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high...

            It wasn't that one but interesting nonetheless.

            It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high amount of data but this guy decided to work over it and managed to optimize it to the point that it went from using GBs of RAM to KBs.

            I cannot manage to find the website anymore, it was on some kind of blog if I remember it right.

            There it is: Strings are evil - Reducing memory allocations from 7.5GB to 32KB

            1 vote
      2. [2]
        GoingMerry
        Link Parent
        I think a lot of people (technical people included) like to jump to a potential solution before understanding the problem and detailing alternatives.

        I think a lot of people (technical people included) like to jump to a potential solution before understanding the problem and detailing alternatives.

        4 votes
        1. biox
          Link Parent
          Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.

          Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.

  2. wise
    Link
    I live for this shit lmao. I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset...

    I live for this shit lmao.

    I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset was 1GB of text. With numbers. Like literally it was a logistic regression problem, with very little noise.

    Ended up doing a random forest and bagging it with the logistic regression because they wanted to publish something and logistic regression was being used already... It didn't make much of a difference but of course they gave talks and published papers. I don't know you, but I consider this unethical.

    6 votes
  3. [2]
    est
    Link
    It's not that comandline tools are fast, it's just Hadoop is slow. Try something like kdb+ or clickhouse, it will blow away your average commandline tool.

    It's not that comandline tools are fast, it's just Hadoop is slow.

    Try something like kdb+ or clickhouse, it will blow away your average commandline tool.

    1. spit-evil-olive-tips
      Link Parent
      That's not an either/or. Yes, Hadoop is slow (or rather, much like the JVM, it has a large start-up time, which is best amortized over a long runtime). However: Using any "big data" solution for...

      That's not an either/or. Yes, Hadoop is slow (or rather, much like the JVM, it has a large start-up time, which is best amortized over a long runtime).

      However:

      the data volume was only about 1.75GB

      Using any "big data" solution for data that small is overkill and unnecessary.

      4 votes