15 votes

Command-line Tools can be 235x Faster than your Hadoop Cluster

Posted August 2, 2018 by pseudolobster

Tags: hadoop, cluster, cloud, processing, big data, aws

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

Link information

This data is scraped automatically and may be incorrect.

Word count: 1729 words

8 comments

[6]
pseudolobster (OP)
August 2, 2018
Link
Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth...

Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth considering how a lot of things people consider "big data" really aren't that big to a single modern computer, and throwing large scale solutions at the problem won't always make it faster.

8 votes
1. [5]
  biox
  August 2, 2018
  Link Parent
  Paraphrasing from twitter: my life as a consultant: Client: tell us how to optimize our big data platform, its costing us millions Me: your dataset fits in memory Me: that'll be $100,000
  
  Paraphrasing from twitter:
  
  my life as a consultant:
  
  Client: tell us how to optimize our big data platform, its costing us millions
  
  Me: your dataset fits in memory
  
  Me: that'll be $100,000
  
  12 votes
  1. [2]
    crius
    August 2, 2018
    Link Parent
    This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense. Just here on tildes, some time ago, was posted an article of how a...
    
    This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense.
    
    Just here on tildes, some time ago, was posted an article of how a developer reduced the memory usage of a very memory intensive processing script from several gigabytes to kilobytes.
    
    Unfortunately this is just the most obvious symptom of the disconnection between management and IT department in which the latter can easily bullshit its way around until an external audit is requested.
    
    5 votes
    
    [2]
    
    Comment deleted by author
    Link Parent
    
    crius
    August 2, 2018 (edited August 3, 2018)
    Link Parent
    It wasn't that one but interesting nonetheless. It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high...
    
    It wasn't that one but interesting nonetheless.
    
    It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high amount of data but this guy decided to work over it and managed to optimize it to the point that it went from using GBs of RAM to KBs.
    
    I cannot manage to find the website anymore, it was on some kind of blog if I remember it right.
    
    There it is: Strings are evil - Reducing memory allocations from 7.5GB to 32KB
    
    1 vote
  2. [2]
    GoingMerry
    August 2, 2018
    Link Parent
    I think a lot of people (technical people included) like to jump to a potential solution before understanding the problem and detailing alternatives.
    
    I think a lot of people (technical people included) like to jump to a potential solution before understanding the problem and detailing alternatives.
    
    4 votes
    
    biox
    August 2, 2018
    Link Parent
    Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.
    
    Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.
wise
August 2, 2018
Link
I live for this shit lmao. I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset...

I live for this shit lmao.

I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset was 1GB of text. With numbers. Like literally it was a logistic regression problem, with very little noise.

Ended up doing a random forest and bagging it with the logistic regression because they wanted to publish something and logistic regression was being used already... It didn't make much of a difference but of course they gave talks and published papers. I don't know you, but I consider this unethical.

6 votes
est
August 2, 2018
Link
It's not that comandline tools are fast, it's just Hadoop is slow. Try something like kdb+ or clickhouse, it will blow away your average commandline tool.

It's not that comandline tools are fast, it's just Hadoop is slow.

Try something like kdb+ or clickhouse, it will blow away your average commandline tool.