Demand for shared data has grown due to machine learning, artificial intelligence (AI), blockchain and geolocation technologies. The CDLA licenses can help organizations open up and share data, with the goal of creating communities that curate and share data openly.
For instance, if automakers, suppliers and civil infrastructure services can share data, they may be able to improve safety, decrease energy consumption and improve predictive maintenance. Self-driving cars are heavily dependent on AI systems for navigation, and need massive volumes of data to function properly. Once on the road, they can generate nearly a gigabyte of data every second. For the average car, that means two petabytes of sensor, audio, video and other data each year.
Similarly, climate modeling can integrate measurements captured by government agencies with simulation data from other organizations and then use machine learning systems to look for patterns in the information. It’s estimated that a single model can yield a petabyte of data, a volume that challenges standard computer algorithms, but is useful for machine learning systems.
“An open data license is essential for the frictionless sharing of the data that powers critical technologies,” said Jim Zemlin, Executive Director of The Linux Foundation. “The success of open source software provides a powerful example of what can be accomplished when people come together around a resource and advance it for the common good. The CDLA is a step in that direction and will encourage the continued growth of applications and infrastructure.”
CDLA Licenses Promote Sharing While Reducing Risk
The Linux Foundation, in collaboration with a broad set of participating organizations, drafted the CDLA licenses with the needs of companies, organizations and communities that have valuable data assets such as these to share. The intention of the licenses is for contributors and consumers of open datasets to actively use and support the contribution of data in a uniform fashion, while clarifying the terms and reducing risk.
There are two CDLA licenses: a Sharing license that encourages contributions of data back to the data community, and a Permissive license that puts no additional sharing requirements on recipients or contributors of open data. A few commercial and community implications of the licenses include:
? Data producers can share with greater clarity about what recipients may do with it. Data producers can also choose between Sharing and Permissive licenses and select the model that best aligns with their interests. In either case, data producers should enjoy the clarity of recognized terms and disclaimers of liabilities and warranties.
? Data communities can standardize on a license or set of licenses that provide the ability to share data on known, equal terms that balance the needs of data producers and data users. Data communities have a high degree of flexibility to add their own governance and requirements for curating data as a community, particularly around areas such as personally identifiable information.
? Data users who are looking for datasets to help kick off training an AI system or for any other use will have the ability to find data shared under a known license model with terms that clarify their rights and responsibilities.
The CDLA is data privacy agnostic and relies on the publisher and curators of the data to create their own governance structure around what data they curate and how. Each producer or curator of data will have to work through various jurisdictional requirements and legal issues.