Main menu


BLOOM: Inside a Radical New Project to Democratize AI

featured image

However, Meta’s model is available only on request and has a license limiting its use to research purposes. Hugface goes one step further. Meetings detailing work over the past year are recorded and uploaded online, and the models are free for anyone to download and use for research and building commercial applications.

A big focus of BigScience was to build ethical considerations into the model from the start, rather than treating them as an afterthought. LLM is trained on large amounts of data collected by scraping the internet. This can be problematic as these datasets contain a lot of personal information and often reflect dangerous biases. This group has developed a data governance structure specifically for LLM. This made it clear what data was used and who it belonged to, and sourced different data sets from around the world that were not readily available online.

The group is also launching a new Responsible AI License, which is like a terms of service. This is designed to deter people from using BLOOM in high-risk areas such as law enforcement and healthcare, and to deter people from harming, deceiving, exploiting, or impersonating them. . The license is an experiment to self-regulate his LLM before the law catches up, says Danish Contractor, his AI researcher who volunteered for the project and co-created the license. But ultimately, nothing stops BLOOM abuse.

Hugging Face ethicist Giada Pistilli, who drafted BLOOM’s ethical charter, says that from the beginning the project had its own ethical guidelines that guided its model development. For example, we recruited volunteers from a variety of backgrounds and locations, made it easy for outsiders to reproduce the project’s findings, and emphasized publishing the results.

Departure progress

This philosophy translates to one of the major differences between BLOOM and other LLMs available today. That’s a huge number of human languages ​​that the model can understand. It can handle 46 languages, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indian languages ​​(including Hindi), and 20 African languages. Over 30% of the training data was in English. This model also understands 13 programming languages.

This is very unusual in the world of large language models, where English is dominant. This is another result of the fact that LLM is built by collecting data from the internet. English is the most commonly used language online.

BLOOM was able to remedy this situation because the team recruited volunteers from around the world to build suitable data sets in other languages. Even if those languages ​​were not well represented online. For example, Hugging Face holds workshops with AI researchers in Africa to find datasets such as records from local governments and universities that can be used to train models in African languages. Intern and Masakhane researcher Chris Emezue said. is an organization working on natural language processing for African languages.