Big data sharing systems face critical security and trust gaps
Open data remains focused on government and research information released for the public good, often in simplified formats and at smaller scales. Data exchange refers to reciprocal transfers of rights, typically with equal privileges for both sides, and does not always deal with high volume or high diversity. Data trading treats datasets as commercial assets and involves buying and selling data under market rules. Big data sharing is positioned as the broadest category, able to include all of the above while remaining open to any type of data provider or recipient.
A new global analysis of big data sharing has identified critical weaknesses that threaten the reliability, security and long-term viability of the world’s expanding data ecosystems. The research, authored by Shan Jiang, maps out the technical, economic and governance challenges that continue to hinder effective data exchange across borders, sectors and platforms.
The study, titled “Big Data Sharing: A Comprehensive Survey” and published in the journal Data, responds to growing pressure on governments, industries and scientific institutions to handle ever larger volumes of data while protecting privacy, meeting security requirements and ensuring equitable access.
According to the study, the rapid pace of data growth has created a gap between what users expect from modern platforms and what current systems are able to deliver.
What counts as big data sharing in 2025?
The author defines big data sharing as the process through which organizations make large, diverse datasets available so that others can discover, access and use them under clearly defined conditions. This definition separates big data sharing from several related concepts that are often confused with it.
Open data remains focused on government and research information released for the public good, often in simplified formats and at smaller scales. Data exchange refers to reciprocal transfers of rights, typically with equal privileges for both sides, and does not always deal with high volume or high diversity. Data trading treats datasets as commercial assets and involves buying and selling data under market rules. Big data sharing is positioned as the broadest category, able to include all of the above while remaining open to any type of data provider or recipient.
Big data sharing must now support growing heterogeneity, the study asserts. This includes variations in formats, structures, use cases and technical demands. This complexity makes it difficult to standardize data or automate the integration of very different datasets. The study notes that any attempt to unify large and varied information requires advanced cleaning, metadata design and quality control processes that can protect the usefulness of the original datasets without destroying essential information.
A three-stage workflow for sharing data at scale
The research outlines a three stage workflow that is now common across global platforms. The first stage, known as publishing, involves the data owner posting metadata to describe a dataset without necessarily sharing the underlying files. The metadata helps platforms index datasets and assists potential users in identifying relevant resources.
The second stage involves search and discovery. Users query platform repositories through semantic search engines or keyword search functions. Their ability to locate accurate datasets depends heavily on the quality and structure of the metadata that is supplied at the publishing stage.
The final stage is the sharing process itself, where owners decide whether to grant access. Once access is granted, the platform logs the transaction. This audit trail supports transparency, governance and regulatory compliance. The author notes that this structured approach has become essential for building trust among organizations that are often hesitant to share sensitive or high value datasets.
Why organizations share data at all
The study identifies several motivations for participating in big data sharing ecosystems. For providers, gaining visibility, increasing reputation and attracting collaboration opportunities are strong incentives. Research institutions in particular benefit when their datasets gain wider use, which increases impact and leads to more citations and partnerships. In commercial settings, monetization remains an important driver as more industries treat data as a strategic resource that can be sold, licensed or synthesized into value added services.
Society benefits when data sharing expands. The study highlights improvements in research integrity, public policy, transparency and scientific progress. With access to larger datasets, researchers can validate findings more effectively. Public institutions can improve planning, infrastructure development and policy implementation. Businesses can create new products and services by combining diverse datasets in innovative ways.
Yet the author also stresses that these benefits cannot be realized without trust. Weak governance, inconsistent standards and unclear risk management strategies continue to discourage many organizations from sharing data. These issues remain structural challenges that the current generation of platforms must confront.
Security, access control and reliability remain core requirements
The survey identifies a list of essential requirements that every modern big data platform must meet. Security and privacy remain the most urgent concerns. Platforms must secure stored data, protect user identities, and maintain the confidentiality of sensitive information. At the same time, organizations expect fine grained access control systems that can restrict data use in flexible ways. The study outlines several access modes now in common use, including search only discovery, preview access, nearline computation and full data transfer.
Reliability is another major requirement. Many traditional cloud based systems exhibit unpredictable behavior when handling cold data, meaning files that are accessed infrequently. This can result in silent failures, corruption or loss. To protect users from these risks, researchers have developed verification methods such as proofs of retrievability, proofs of storage and provable data possession schemes. These systems confirm that data remains intact without requiring full downloads or frequent access.
Performance and scalability also remain essential. As the number of datasets grows, platforms must serve many simultaneous users while handling large requests without delays. This includes supporting high speed analytics for fields like healthcare, finance and scientific computing, where timely access is essential.
Healthcare and data markets lead the adoption curve
The study identifies two major application domains driving global interest in big data sharing. The first is healthcare, where hospitals, laboratories and research institutions require secure mechanisms to exchange electronic health records, imaging data, clinical trial results and smart home health information. Effective sharing improves diagnosis, treatment, public health surveillance and the development of precision medicine.
The second major domain is data trading. Large corporations and government agencies now treat data as an economic asset that can be priced, licensed or sold on specialized exchanges. The survey highlights several commercial platforms, including major cloud providers and national data exchanges that enable organizations to trade datasets under regulated frameworks. The rise of these marketplaces has increased demand for models that can measure data quality and assign accurate economic value. The author notes that data quality assessment remains a significant challenge and requires multi dimensional evaluation criteria such as accuracy, completeness, timeliness and provenance.
Different platforms for different needs
The study reviews five representative platforms, each designed for different types of users and data.
- The Epimorphics Linked Data Platform focuses on semantic richness and strong metadata control. It is built to manage high variety environments where the meaning and structure of data are more complex than the volume.
- The HKSTP Data Studio, now expanded into a Digital Service Hub, supports smart city initiatives. It offers application programming interfaces for data integration, high performance computing for analytics and a secure data clean room environment where sensitive information can be analyzed without exposing raw data.
- SEEK, an open source platform used in systems biology, supports rich scientific datasets and collaborative workflows. It integrates experiments, models and data under a unified structure to help laboratories coordinate large scale research activities.
- IPFS offers a decentralized model built around peer to peer networks. It uses content addressing to locate files by cryptographic hashes, which strengthens authenticity and lowers dependence on single servers. The system faces challenges due to variable user incentives and less consistent hosting of large datasets.
- AWS Data Exchange represents a commercial cloud-based marketplace designed for large volume data distribution. It is integrated with major cloud storage systems, making it a popular option for enterprises that need to share and sell large datasets at scale.
Architectural models and their tradeoffs
The study groups these platforms into three architectural categories.
- Data Hosting Centers are centralized repositories that store and distribute datasets.
- Data Aggregation Centers host only metadata while leaving data in the hands of the original owners.
- Decentralized platforms rely on peer to peer networks without a central authority.
Each architecture has strengths and weaknesses. For instance, centralized systems offer strong efficiency and usability. Aggregation centers reduce storage burdens and support privacy. Decentralized systems improve resilience and authenticity. However, no single model solves all key challenges, and organizations must choose based on their risk profile, use case and governance model.
- READ MORE ON:
- big data sharing
- data governance
- data security
- data privacy
- scalable data platforms
- data verification
- cloud data reliability
- decentralized data systems
- data marketplaces
- data aggregation platforms
- metadata management
- data standardization
- data quality assessment
- data valuation models
- privacy preserving data sharing
- healthcare data exchange
- big data trading
- data hosting centers
- data aggregation centers
- peer to peer data sharing
- FIRST PUBLISHED IN:
- Devdiscourse

