Eleuthei, a research organization AI, has published what it claims is one of the great collections of licenses and open dominated text for the formation of artificial intelligence models.
The database, called Common Pile V0.1, took about two years to complete in collaboration with the startups to the poolside, I embrace the face and others, together with different academic institutions. Ponderation to 8 Terabbyte of size, the common pile V0.1 has been used to train two new models of AI DA Eleutherai, comma v0.1-1t and command V0.1-2T, which eleuthe you will say that they perform from developed models using unlimited data and copyright.
Artificial intelligence companies, including open, are embroidered causes On their AI training practices, which are based on the scraping of the web, including copyright protected material such as books and research magazines, to create models training data sets. While some artificial intelligence companies have licensed agreements with some content suppliers, most maintain the legal doctrine of the United States of fair fair use that protects them from responsibility in the boxes in which they trained at the copyright work without permission.
Eleuthe will argue that these causes have “drastically reduced” transparency by artificial intelligence companies, which according to the organization has damaged the largest field of research of artificial intelligence making it more difficult to understand how models work and what their defects could be.
“” “”[Copyright] Legal causes have not significantly modified data supply practices [model] The training, but drastically reduced the transparency companies in which they commit themselves “, Stella Biderman, executive director of Eleutherai, wrote in a blog post on Hugging Face Friday.” Even the researchers from some companies that we talked have not had were not able to release the search they are doing in highly focused areas on data. “
The common pile V0.1, which can be downloaded from the platform to the devices of hugs and Github, was created in consultation with legal experts and is based on sources including 300,000 public domain books digitized by the Congress Library and the Internet Archive. Eleutherai also used Whisper, the open source vocal model of Openi, to transcribe the audio content.
Eleuthe will say that the V0.1-1t comma and the V0.1-2T command are proof that the common pile V0.1 has been carefully treated to allow developers to build the competition of models with factories. According to Eleuthei, the models, both of 7 billion parameters of size and only a fraction of the common battery V0.1 have been trained, rival models such as the first model of ai ai llam of destination on the reference parameters for coding, understanding of images and mathematics.
The parameters, sometimes compromised as weights, are the internal components of an Ai model that guide their behavior and responses.
“In general, we believe that the common idea that the non -license text guides the performance is agiustified,” Biderman wrote in his post. “As the amount of data openly accessible to license and public domain increases, we can expect the quality of the models formed on the openly license content will improve.
The common pile V0.1 seems to be partly an effort to correct the historical wrongs of Eleutherai. Years ago, the company released the pile, an open collection of training text that included copyright protected material. Artificial intelligence companies were put under fire and legal pressure – for the use of the pile to form the models.
Eleuthe will be committed to issuing open data sets more frequently in collaboration with its research and infrastructure partners.