14 votes

The great data integration schlep

4 comments

  1. creesch
    Link
    Interesting read with a lot of points that are very valid. One thing I'd like to add is that companies often don't even know where all the data is and even what data they have. This became...

    Interesting read with a lot of points that are very valid. One thing I'd like to add is that companies often don't even know where all the data is and even what data they have. This became painfully clear for a lot of EU companies when they tried to comply with GDPR laws. In fact, I am willing to bet that there are still companies out there where they're really hoping that they will not be audited as they still don't have a handle on their data.

    And this is just one type of data they are trying to get a handle on, as GDPR mainly deals with personally identifiable information.

    Then there is the fact that sometimes a large chunk of their data is actually total and utter crap. Data hoarding is a thing, I've seen companies keep records going back decades. This might be interesting to historians, but most of that data has no real relevance to the market or company in the present day.

    Finally, what a lot of people might not realize is that the LLMs currently out there also are not trained by just throwing all the data in and hoping for the best. They, too, have been incredibly labor-intensive as the input needed to be vetted by humans. Because so many people are needed, a lot of it was done by cheap workers in African countries, which according to that article resulted in some interesting side effects on its output.

    10 votes
  2. [3]
    skybrian
    Link
    From the article: ... ... ... ... ...

    From the article:

    Obtaining the data is a hard human problem.

    That is, people don’t want to give it to you.

    When you’re a software vendor to a large company, it’s not at all unusual for it to be easier to make a multi-million dollar sale than to get the data access necessary to actually deliver the finished software tool.

    ...

    Dealing with the “human problem” of negotiating for data access is a huge, labor-intensive headache, and the effort scales pretty much linearly in the amount of data you’re trying to collect.

    Palantir Technologies is now embracing the “AI” label, but back when I worked there, in 2016-2017, they billed themselves as a “data integration” company, because this is fundamentally what they do. Palantir builds its own software tools for managing and analyzing data — databases, ETL pipelines, analytics dashboards — and those tools work fine, but they are not, as far as I know, unique or exceptional in the tech industry. What is remarkable is that they have invested in a large number of people — at least a third of the company by headcount — to descend en masse to the customer’s facilities, set up and customize their database and associated tools, teach the users to work the UI, and, crucially, negotiate for data access.

    ...

    The Palantir Way is labor-intensive and virtually impossible to systematize, let alone automate away. This is why there aren’t a hundred Palantirs. You have to throw humans at the persuasion problem — well-paid, cognitively flexible, emotionally intelligent humans, who can cope with corporate dysfunction.

    In that way, they’re a lot like management consultants…and in fact, data integration at large scale is inherently a little like management consulting.

    ...

    I would expect that LLMs could make substantial, if not total, improvements in automating data cleaning, but my preliminary experiments with commercial LLMs (like ChatGPT & Claude) have generally been disappointing; it takes me longer to ask the LLM repeatedly to edit my file to the appropriate format than to just use regular expressions or other scripting methods myself. I may be missing something simple here in terms of prompting, though, or maybe LLMs need more surrounding “software scaffolding” or specialized fine-tuning before they can make a dent in data cleaning tasks.

    ...

    This is why I disagree with a lot of people who imagine an “AI transformation” in the economic productivity sense happening instantaneously once the models are sufficiently advanced.

    For AI to make really serious economic impact, after we’ve exploited the low-hanging fruit around public Internet data, it needs to start learning from business data and making substantial improvements in the productivity of large companies.

    ...

    You’d need to get enough access to private R&D data to train the AI, and build enough credibility through pilot programs to gradually convince companies to give the AI free rein, and you’d need to start virtually from scratch with each new client. This takes time, trial-and-error, gradual demonstration of capabilities, and lots and lots of high-paid labor, and it is barely being done yet at all.

    I’m not saying “AI is overrated”, at all — all of this work can be done and ultimately can be extremely high ROI. But it moves at the speed of human adaptation.

    6 votes
    1. [2]
      Weldawadyathink
      Link Parent
      I don’t have much to add to the conversation here, but Palantir is a fantastic name for this sort of company.

      I don’t have much to add to the conversation here, but Palantir is a fantastic name for this sort of company.

      4 votes
      1. creesch
        Link Parent
        That completely flew by me, it is a very fitting name. A palantír (IPA: [paˈlanˌtiːr]; pl. palantíri) is one of several indestructible crystal balls from J. R. R. Tolkien's epic-fantasy novel The...
        1 vote