What's our thoughts on Perplexity.ai for search? - ~tech

[8]

DefiantEmbassy

June 28, 2024

Link

Perplexity engage in outright plagiarism, and don’t respect robots.txt. If that’s something you care about, stay away.

16 votes

[4]
Wes
June 29, 2024
Link Parent
That's a little misleading. Perplexity uses snippets with attribution in the same way search engines do. This has historically been considered a fair use application. It doesn't fit the definition...

That's a little misleading. Perplexity uses snippets with attribution in the same way search engines do. This has historically been considered a fair use application. It doesn't fit the definition of plagiarism as they are not claiming this content as their own.

Additionally, Perplexity does respect robots.txt for training their AI model. They only do not respect it when following a user's request to scan a page, which is the correct behaviour. robots.txt is specifically for automated web crawlers or spiders. User agents, meaning tools that follow a user's commands, are not subject to robots.txt. Your web browser and command line applications like wget do not follow robots.txt either, because they are acting as user agents, not as robots.

15 votes
1. [3]
  DefiantEmbassy
  June 29, 2024
  Link Parent
  I reject your characterization. Maybe I should have linked the Forbes article directly, but what we're talking about has absolutely nothing to do with what you are on about. This is just theft....
  
  That's a little misleading. Perplexity uses snippets with attribution in the same way search engines do. This has historically been considered a fair use application. It doesn't fit the definition of plagiarism as they are not claiming this content as their own.
  
  I reject your characterization. Maybe I should have linked the Forbes article directly, but what we're talking about has absolutely nothing to do with what you are on about.
  
  In case you’ve missed the brouhaha, here’s a quick (human-generated) summary: For most of this year, two of our best journalists, Sarah Emerson and Rich Nieva, have been reporting on former Google CEO Eric Schmidt’s secretive drone project, including a June 6 story detailing the company’s ongoing testing in Silicon Valley suburb Menlo Park as well as the frontlines of Ukraine. The next day, Perplexity published its own “story,” utilizing a new tool they’ve developed that was extremely similar to Forbes’ proprietary article. Not just summarizing (lots of people do that), but with eerily similar wording, some entirely lifted fragments — and even an illustration from one of Forbes’ previous stories on Schmidt. More egregiously, the post, which looked and read like a piece of journalism, didn’t mention Forbes at all, other than a line at the bottom of every few paragraphs that mentioned “sources,” and a very small icon that looked to be the “F” from the Forbes logo – if you squinted. It also gave similar weight to a “second source” — which was just a summary of the Forbes story from another publication.
  
  Perplexity then sent this knockoff story to its subscribers via a mobile push notification. It created an AI-generated podcast using the same (Forbes) reporting — without any credit to Forbes, and that became a YouTube video that outranks all Forbes content on this topic within Google search. Perplexity had taken our work, without our permission, and republished it across multiple platforms — web, video, mobile — as though it were itself a media outlet. As we dug, we found a similar rip-off of a second story at Forbes. And other stolen scoops — all the information, negligible citation — from Bloomberg and CNBC.
  
  It gets worse. When Forbes Executive Editor John Paczkowski called out Srinivas on X for what the company had done, Srinivas responded that this new “product feature” had some “rough edges.” (“Product feature sound nice, but us media types call it plagiarism,” longtime tech journalist Kara Swisher adroitly responded on X.) And that was that: the story wasn’t removed, nor was there an apology, nor was the story corrected to provide more transparent attribution within the text.
  
  This is just theft. The WIRED article also mentions what they would consider /plagarism/, not a cute little search engine snippet like you suggest.
  
  After we published the story, I prompted three leading chatbots to tell me about the story. OpenAI’s ChatGPT and Anthropic’s Claude generated text offering hypotheses about the story’s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them. (According to WIRED's server logs, the same bot observed in our and Knight’s findings, which is almost certainly linked to Perplexity but is not in its publicly listed IP range, attempted to access the article the day it was published, but was met with a 404 response. The company doesn't retain all its traffic logs, so this is not necessarily a complete picture of the bot's activity, or that of other Perplexity agents.) The original story is linked at the top of the generated text, and a small gray circle links out to the original following each of the last five paragraphs. The last third of the fifth paragraph exactly reproduces a sentence from the original: “Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods.”
  
  This struck me and my colleagues as plagiarism. It certainly appears to satisfy the criteria set out by Poynter Institute—including, perhaps most stringently, the seven-to-10 word test, which proposes that it’s “hard to incidentally replicate seven consecutive words that appear in another author’s work.” (Kelly McBride, a Poynter SVP who has described this test as being useful in identifying plagiarism, did not reply to an email.)
  
  [...]
  
  “In terms of the copyright, this is a tough call,” says James Grimmelmann, professor of digital and information law at Cornell University. On one hand, he argues, the summary is reporting facts, which cannot be copyrighted; but on the other, it does partially duplicate the original and summarize the details found in it. “It’s not a slam dunk copyright case, but it’s not trivial, either. It’s not frivolous.”
  
  6 votes
  1. [2]
    unkz
    June 29, 2024 (edited June 29, 2024)
    Link Parent
    So I had to dig a bit to find the actual material but it's here: https://www.perplexity.ai/search/perplexity-is-a-41uH2h6JT0qazoM87BO.kw From my perspective, this complaint is inane. The user...
    
    After we published the story, I prompted three leading chatbots to tell me about the story. OpenAI’s ChatGPT and Anthropic’s Claude generated text offering hypotheses about the story’s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, 287-word text...
    
    So I had to dig a bit to find the actual material but it's here:
    
    https://www.perplexity.ai/search/perplexity-is-a-41uH2h6JT0qazoM87BO.kw
    
    From my perspective, this complaint is inane. The user asked this question: "tell me about the wired article "perplexity is a bullshit machine." What perplexity did, is read the wired article, and then produce a summary, filled with hyperlinks to the article. In particular, the fifth paragraph where the content was quoted is specifically annotated with a link to the source where the content came from. What more are people expecting here?
    
    In addition, the article presents an experiment where WIRED created a test website and asked Perplexity to summarize it. Despite monitoring the website's server logs, no evidence was found that Perplexity attempted to visit the page. Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods 1.
    
    Here's the perplexity article that set this all off:
    
    https://www.perplexity.ai/page/Eric-Schmidts-AI-boKJzWQcRFmCLk5XjgKJEQ
    
    Now this is post-complaint, and apparently the links to Forbes are more prominently placed at the top of the page than they originally were but as I understand it, the rest of the content below is the same. I honestly don't see an issue with this. It's extensively hyperlinked to the source material -- like these pages, which use Forbes as a source, yet they aren't being characterized as plagiarism. What's the difference?
    
    https://tech.hindustantimes.com/tech/news/former-google-ceo-eric-schmidt-jumps-into-ai-attack-drones-space-looks-to-transform-military-tech-71706178365198.html
    
    https://www.businessinsider.com/eric-schmidt-poaches-apple-spacex-google-ai-drones-report-2024-6
    
    I assume the reason they aren't being criticized is because they clearly state that they used Forbes as a source -- exactly like perplexity's article, in the second paragraph:
    
    Eric Schmidt, the former Google CEO, has been actively recruiting top talent from companies like Apple, SpaceX, Google, and the federal government for his secretive drone venture over the past few months. Approximately a dozen employees have joined the initiative, which was previously known as White Stork but is now rumored to be named Project Eagle. Schmidt's nonprofit organization, Schmidt Futures, has also been a key source of personnel for the venture, as reported by Forbes.
    
    and fourth paragraph:
    
    Based on the latest Forbes investigative report, White Stork, the company founded by Eric Schmidt in August 2023, has been operating under the radar through a network of LLCs. Initially named Swift Beat Holdings, it rebranded to White Stork Group LLC in September.
    
    Keep in mind that this is a summary with only 6 paragraphs -- two inline mentions of Forbes, and 4 hyperlinks. The Hindustan Times article in comparison only mentions Forbes once.
    
    Finally, I think it's pretty funny that in all of this content criticizing perplexity for plagiarism, I couldn't find a single link to perplexity's actual article. I had to go to twitter and forum posts to actually see what they were talking about.
    
    3 votes
    
    DefiantEmbassy
    July 17, 2024
    Link Parent
    Apologies for the late reply, I don't browse Tildes often: (This is entirely wrong, as shown by this Wayback Machine archive prior to the Forbes article. In addition, this fails to cover the...
    
    Apologies for the late reply, I don't browse Tildes often:
    
    Now this is post-complaint, and apparently the links to Forbes are more prominently placed at the top of the page than they originally were but as I understand it, the rest of the content below is the same.
    
    (This is entirely wrong, as shown by this Wayback Machine archive prior to the Forbes article. In addition, this fails to cover the podcast that Perplexity autonomously generated using this plagarized article).
    
    1 vote
[2]
unkz
June 29, 2024
Link Parent
Hold on here, this is absolutely not plagiarism. Their entire raison d'etre is explicit attribution of all material.

Hold on here, this is absolutely not plagiarism. Their entire raison d'etre is explicit attribution of all material.

11 votes
1. DefiantEmbassy
  June 29, 2024
  Link Parent
  Please see my above comment for why I wholeheartedly reject your characterization.
  
  this is absolutely not plagiarism
  
  Please see my above comment for why I wholeheartedly reject your characterization.
  
  3 votes
fxgn
June 29, 2024
Link Parent
Is that a bad thing? https://wiki.archiveteam.org/index.php/Robots.txt

don't respect robots.txt

Is that a bad thing?

https://wiki.archiveteam.org/index.php/Robots.txt

3 votes

[2]

macleod

June 28, 2024

Link

I am always hesitant to try different AI models in any real usage, and that's been the case for years as I've worked in robotics and ML predictions has always been uh... distant from reality, and...

I am always hesitant to try different AI models in any real usage, and that's been the case for years as I've worked in robotics and ML predictions has always been uh... distant from reality, and the new types have bolstered that.

But, Perplexity has been surprisingly solid, it's a summarizer first and foremost, and so it tends to not have the same issues of hallucinations as others.

Is it perfect? Absolutely not. But is it useful? Absolutely, especially since its one of the few new types (calling it that gives me gundam vibes tbh), that cites its sources incredibly well. I've actually been enjoying playing with it for basic requests. Nothing serious, but just anything I might want to quickly grok Wikipedia for an answer.

Kagi has a very similar summarizer for webpages (Universal Summarizer) that works rather well for single sources, and on their search results they have a "Quick Answers" that afaic tell, uses their Universal Summarizer to summarize their own search results page and then links into an LLM (Claude 3 Haiku) to give it a bit more readable personality and style that I've been enjoying, but it doesn't cite its sources as nicely as Perplexity.

Additional:

I'm also a sucker for anything that formats things in bullet points, so Perplexity gets major points for that as well.

15 votes

Plik
June 29, 2024
Link Parent
The Kagi summarize and quick answers are pretty nice. Been using them fairly regularly since I subscribed. The quick answer appears to take key points from the first page of results and put them...

The Kagi summarize and quick answers are pretty nice. Been using them fairly regularly since I subscribed. The quick answer appears to take key points from the first page of results and put them into a short list of bullet points. Sometimes you can see the exact sentences used in the summary in the descriptions of some of the results.

1 vote

[2]

Ganymede

June 28, 2024

Link

https://www.theverge.com/2024/6/27/24187405/perplexity-ai-twitter-lie-plagiarism

7 votes

unkz
June 28, 2024
Link Parent
The author clearly has no idea what the term “rent-seeking” means. Perplexity is quite obviously adding value here. I don’t want to personally wade through pages and pages of fluff to get the...

That means that Perplexity is basically a rent-seeking middleman on high-quality sources.

The author clearly has no idea what the term “rent-seeking” means. Perplexity is quite obviously adding value here. I don’t want to personally wade through pages and pages of fluff to get the actual information that I want.

9 votes

Baeocystin

June 28, 2024

Link

I personally prefer Phind, but they are very similar, and it isn't uncommon that I'll use both on a query. They are far from perfect, but as the other poster said, they're already well in the...

I personally prefer Phind, but they are very similar, and it isn't uncommon that I'll use both on a query.

They are far from perfect, but as the other poster said, they're already well in the Useful range. Just be extra careful when you're working outside of your areas of expertise, as it becomes much harder to catch inaccurate information.

4 votes

[3]

fxgn

June 28, 2024

Link

Haven't used Perplexity, but as @drannex mentioned, Kagi has a similar feature, and I use it for queries where I trust LLMs to give me accurate information. For example, just yesterday, I searched...

Haven't used Perplexity, but as @drannex mentioned, Kagi has a similar feature, and I use it for queries where I trust LLMs to give me accurate information. For example, just yesterday, I searched for "is azelaic acid poisonous?" since I use it as a skincare product and was worried about accidentally getting it on my lips, and the summarizer responded that it is not considered toxic and cited a few studies.

This saved me a bunch of time, since most of the results were pages with general information about azelaic acid or broad scientific papers about the topic, so I would've had to do a bit of digging to find the actual answer.

On some queries, I usually don't trust the AI and prefer to check the results myself. In most cases, those are the searches where I want to find the sentiment or experiences of other people about a particular thing, rather than specific facts.

3 votes

[2]
Moah
June 29, 2024
Link Parent
Considering AI had been known to invent sources in the past, I would follow up on their citations before trusting any answers

Considering AI had been known to invent sources in the past, I would follow up on their citations before trusting any answers

1 vote
1. fxgn
  June 29, 2024
  Link Parent
  The point of those search summary AIs is that they cite the sources with specific links https://i.ibb.co/TcRptMF/Screenshot-20240629-140308.png
  
  The point of those search summary AIs is that they cite the sources with specific links
  
  https://i.ibb.co/TcRptMF/Screenshot-20240629-140308.png
  
  1 vote