Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth...
Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth considering how a lot of things people consider "big data" really aren't that big to a single modern computer, and throwing large scale solutions at the problem won't always make it faster.
Paraphrasing from twitter: my life as a consultant: Client: tell us how to optimize our big data platform, its costing us millions Me: your dataset fits in memory Me: that'll be $100,000
Paraphrasing from twitter:
my life as a consultant:
Client: tell us how to optimize our big data platform, its costing us millions
This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense. Just here on tildes, some time ago, was posted an article of how a...
This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense.
Just here on tildes, some time ago, was posted an article of how a developer reduced the memory usage of a very memory intensive processing script from several gigabytes to kilobytes.
Unfortunately this is just the most obvious symptom of the disconnection between management and IT department in which the latter can easily bullshit its way around until an external audit is requested.
It wasn't that one but interesting nonetheless. It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high...
It wasn't that one but interesting nonetheless.
It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high amount of data but this guy decided to work over it and managed to optimize it to the point that it went from using GBs of RAM to KBs.
I cannot manage to find the website anymore, it was on some kind of blog if I remember it right.
Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.
Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.
I live for this shit lmao. I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset...
I live for this shit lmao.
I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset was 1GB of text. With numbers. Like literally it was a logistic regression problem, with very little noise.
Ended up doing a random forest and bagging it with the logistic regression because they wanted to publish something and logistic regression was being used already... It didn't make much of a difference but of course they gave talks and published papers. I don't know you, but I consider this unethical.
It's not that comandline tools are fast, it's just Hadoop is slow. Try something like kdb+ or clickhouse, it will blow away your average commandline tool.
It's not that comandline tools are fast, it's just Hadoop is slow.
Try something like kdb+ or clickhouse, it will blow away your average commandline tool.
Just stumbled across this article from 2014 and thought it was interesting food for thought. Title's a bit clickbaity, since it only applies in this one suboptimal example, but it's worth considering how a lot of things people consider "big data" really aren't that big to a single modern computer, and throwing large scale solutions at the problem won't always make it faster.
Paraphrasing from twitter:
my life as a consultant:
Client: tell us how to optimize our big data platform, its costing us millions
Me: your dataset fits in memory
Me: that'll be $100,000
This is the "nice" answer for "have a decent IT team" and unfortunately it is often the only answer that really make sense.
Just here on tildes, some time ago, was posted an article of how a developer reduced the memory usage of a very memory intensive processing script from several gigabytes to kilobytes.
Unfortunately this is just the most obvious symptom of the disconnection between management and IT department in which the latter can easily bullshit its way around until an external audit is requested.
It wasn't that one but interesting nonetheless.
It was about a script that was processing lots of financial stuff for the guy's company's clients. The script had to run overnight due to the high amount of data but this guy decided to work over it and managed to optimize it to the point that it went from using GBs of RAM to KBs.
I cannot manage to find the website anymore, it was on some kind of blog if I remember it right.
There it is: Strings are evil - Reducing memory allocations from 7.5GB to 32KB
I think a lot of people (technical people included) like to jump to a potential solution before understanding the problem and detailing alternatives.
Absolutely - I run into this at work often. Us technical people like fun solutions over boring arduous ones. You mean you don't want to distribute your computing? Ugh, give it to the junior.
I live for this shit lmao.
I was in a project where they wanted to do a regression model with "Deep Learning and Neural Networks" (they didn't know what either of those things mean). The dataset was 1GB of text. With numbers. Like literally it was a logistic regression problem, with very little noise.
Ended up doing a random forest and bagging it with the logistic regression because they wanted to publish something and logistic regression was being used already... It didn't make much of a difference but of course they gave talks and published papers. I don't know you, but I consider this unethical.
It's not that comandline tools are fast, it's just Hadoop is slow.
Try something like kdb+ or clickhouse, it will blow away your average commandline tool.