8 votes

Facial recognition's 'dirty little secret': Millions of online photos scraped without consent

3 comments

  1. [3]
    cykhic Link
    I don't personally see a huge legal or ethical issue here. The article states that all of these photos were previously published by users under Creative Commons, which (correct me if I'm wrong)...

    I don't personally see a huge legal or ethical issue here. The article states that all of these photos were previously published by users under Creative Commons, which (correct me if I'm wrong) allows anyone to use it for commercial purposes, and to modify and distribute it. It seems that Flickr was well within their rights to release the photos, and IBM to subsequently use them as well, for whatever purpose they wish, so long as the photos were credited (I'm unable to verify whether they were actually credited).

    The article's objections seem to boil down to,

    1. It "feels wrong" that people's personal photos can be scraped without their consent, despite the legal okay; and

    2. Being able to accurately identify minorities may lead to systemic profiling in the future.

    Regarding (1), I think once people put the Creative Commons license on their photos and uploaded them to Flickr, they lose the right to complain if others use it. There was a case in the article where a photographer had uploaded 700 clients' photos online which were used in the dataset, and the photographer was worried about consent from the clients. I don't have much sympathy for him, because he was the one who uploaded it under that license.

    Regarding identification of individuals, I don't think names are included in the dataset. (Not the same as photographer credit.) So there is no chance of being able to identify someone, unless their name can be cross-compared with another dataset, in which case that other dataset is the problem. On the other hand, if the names were included when uploading the photo to Flickr, then anyone browsing Flickr or googling the person's name would have been able to link the name and photo anyway.

    Regarding (2), I recall reading an article a while back arguing that minorities are discriminated against because facial-recognition AI couldn't recognise them, due to the training data being predominantly white and male. Now that a diverse dataset is available, the opposite is being argued. It can't be simultaneously true that not-identifying and correctly-identifying minorities are BOTH racist or sexist, can it? I think this resolves to the same issue as technology in general -- this is a powerful tool that can be used for both good and bad. Trying to block it is futile and counterproductive.

    I feel annoyed that a move by IBM that I feel is aimed at promoting diversity and scientific open access, and has minimal additional privacy loss (over what has already been exposed by users themselves) has been spun into some sensationalist anti-technology piece.

    A disclaimer that the above is all my own not-paricularly-well-researched opinion. I also don't feel strongly on privacy issues, which from the Tildes survey seems to place me in the minority. I'm happy to discuss further or have my mind changed if I'm wrong on anything.

    6 votes
    1. Luna Link Parent
      This is a common misconception. It all depends on which types (or "layers" as CC calls it) of CC are used, and you can view a quick rundown on the different types here. The licenses Flickr lets...

      published by users under Creative Commons, which (correct me if I'm wrong) allows anyone to use it for commercial purposes, and to modify and distribute it

      This is a common misconception. It all depends on which types (or "layers" as CC calls it) of CC are used, and you can view a quick rundown on the different types here. The licenses Flickr lets you choose from can be found here, and assuming IBM used public domain or any of CC's non-NonCommercial licenses, they are (probably) perfectly fine.

      I looked into the original blog post, and they are using the YFCC-100M dataset. Some Googling for it led me to this article, which states:

      Licenses. The licenses themselves vary by CC type, with approximately 31.8% of the dataset marked appropriate for commercial use and 17.3% assigned the most liberal license requiring attribution for only the photographer who took the photo (see Table 3).

      They have a (really low-res) chart of the license breakdown, and the majority are NonCommercial, but that still leaves over 30 million images and 250k videos which can be used commercially. From the total dataset, they have another (really low-res) chart of the things portrayed, and 11 million of the photos have people in them. Of course, we don't know how many of these are not NonCommercial, but I'd bet at least 2-3 million are not NonCommercial.

      If someone is annoyed their photo/video was used when they didn't bother picking the correct license, they have nobody to blame but themselves. However, if IBM did use NonCommercial items, they could be in trouble depending on the circumstances, but I'm sure their army of lawyers have went over all this beforehand. In addition, the IBM blog post states:

      DiF provides a dataset of annotations of 1 million human facial images

      I'd bet all their images are not NonCommerical since they're nowhere close to the 11 million total images of people, much less the 30 million that are not NonCommercial.

      TL;DR - CC is often over-simplified as public-domain-but-with-attribution, but it has a lot more flexibility in restricting its use. In this case, I believe IBM has only used images that are not NonCommercial, so anyone annoyed that their images were used only has themself to blame for not properly choosing the appropriate license type on Flickr.

      3 votes
    2. PineappleMan Link Parent
      Thanks for bringing facts to the table... Creative Commons are there for a reason, and when photographers upload pictures they can agree to these. Also, I love what IBM did with this dataset,...

      Thanks for bringing facts to the table... Creative Commons are there for a reason, and when photographers upload pictures they can agree to these.

      Also, I love what IBM did with this dataset, hopefully, it will lead to innovations that are helpful to humankind

      2 votes