12
votes
Much of the innovation in natural language processing comes from the US, resulting in an English language bias – Finland decided to change the game with a collective approach
Link information
This data is scraped automatically and may be incorrect.
- Title
- Europe's fastest supercomputer trains large language models in Finland | Computer Weekly
- Word count
- 505 words
Natural language processing is absolutely biased towards English, but I think attributing this solely to the innovation coming from the US is not entirely accurate. Institutions in other countries doing NLP still focus principally on English as a rule, with even other major languages taking a back seat. A lot of this ends up being an awful feedback loop, where all the good data and benchmarks are in English, so you limit your research to English, so any good data or benchmarks you come up with are in English, and so on. I applaud research instutions working to improve coverage of non-English languages across the board. The article itself rightly points out the major difficulty in getting sufficient data in a language like Finnish, but even large global languages like Spanish and Arabic are woefully underrepresented in NLP.
Of course the computing costs are also a huge issue, but I think they're also an issue within English NLP, since they make it very difficult for anyone but the biggest for-profit companies to develop these super large models.