Wikimedia’s non-profit structure announced a new database initiative to enable artificial intelligence systems to access a vast pool of information more easily. Named the Wikidata Embedding Project, this effort aims to convert approximately 120 million open data points in Wikidata into a format more usable by machines. Although the data is machine-readable, it was not previously integrated into generative AI in a fully compatible way. The project converts this data into concept vectors, representing their relationships as numerical coordinates.
This transformation based on the distance between concepts allows closely related concepts like “dog” and “puppy” to form nearer clusters, while unrelated concepts like “dog” and “bank account” weaken their connection. This makes it possible for AI systems to understand context better and process natural language more intelligently.
The goal is simple: to provide higher quality and more trustworthy information to AI models. Currently, some AI systems rely on data sets that lack neutrality; therefore, presenting Wikidata in an open and accessible format introduces both reliability and a competitive advantage for players of different scales. The project ensures that not only large tech companies but also smaller AI firms can access this rich resource.
While Wikimedia Deutschland is leading and supporting the project, data creation was contributed by Jina AI, and data storage is powered by IBM’s DataStax system. This collaboration enhances the processes of converting data into vectors and storing them.
As an interesting development in response to this initiative, Elon Musk announced a project called Grokipedia via X (Twitter). Musk argues that this step is necessary for xAI to understand the universe, but whether this venture will rely on open data like the Wikidata Embedding Project remains a question. Musk’s previous criticisms and posts about Wikipedia further fueled this competition.
The quality of data plays a decisive role here. AI generates responses based on the data it is fed; therefore, using high-quality and unbiased data directly influences users’ perception of reality. Openly accessible and well-organized data can raise the standards of trustworthiness and transparency in AI systems.