AI is like electricity
as electricity transformed every major a century ago
AI is now poised to do the same.
Great quote by Andrew Ng, Former Vice President Of Baidu and Co-founder of Google Brain.
The quote holds now and interestingly to generate electricity, sources are primary whereas to leverage AI, data is the primary source to transform.
On July 19, The New York Times released an article detailing the end of an era for data accessibility among researchers, practitioners, and AI companies. New restrictions have been introduced, limiting data extraction and availability.
A research group led by MIT, the Data Providence Initiative, has identified that AI companies relying on data from websites, open sources, and other organizations are beginning to encounter significant content drops and new barriers to data extraction.
In 2020, AI models designed to deliver advanced services utilized web crawlers to gather data from various sources, feeding these data sets into training models to achieve exceptional results. This practice contributed to the rapid rise of AI technologies such as OpenAI’s ChatGPT, Google Gemini, and Microsoft’s AI Copilot. As the AI trend expanded, more companies started developing their own models, recognizing data as a crucial resource for training.
Among the notable data sets are C4, RefinedWeb, and Dolma, as well as other secondary sources. However, the extensive use of data has led to legal challenges. Last year, online publishers, including The New York Times, sued OpenAI and Microsoft for copyright infringement, alleging that these companies were using news articles to train their models without permission. Data from YouTube videos and other sources were being transcribed and utilized.
This has prompted the creation of paywalls for AI companies, with platforms like Reddit and Stack Overflow charging for data extraction. Other companies and websites have implemented the Robots Exclusion Protocol, using files like robots.txt to prevent automated bots from crawling their sites. Moreover, they have strengthened their security measures by updating terms of service and confidentiality agreements.
Research has revealed that the data sets C4, RefinedWeb, and Dolma now contain only 5% of their previous data, with 25% of high-quality data being restricted. These limitations apply to individuals, practitioners, researchers, and non-profit research firms alike.
The decision by websites and companies to restrict data access has become a double-edged sword. On one hand, it forces AI companies to reconsider their data usage practices, promoting fair use. On the other hand, it poses significant challenges for individuals, practitioners, and research firms conducting analyses and surveys. The potential for evolving AI models and the growth of AI companies may be hindered by these new paywalls and restrictions.
We hope you like the content on how data restrictions become challenging for AI startups and researchers. Feel free to share your valuable suggestions through comments.
_Team AjursContent.