Random Sample Integrity Tests in big, Segmented Databases
Keywords:
Segmented databases, sample integrity, statistical bias, Bootstrap, KS-Test, Data SkewAbstract
With the rapid expansion of data volumes and the increasing reliance on distributed databases, the integrity of random sampling in statistical analyses has become a significant methodological challenge. Data skew, or uneven distribution, can lead to statistical bias affecting the accuracy of estimates and the reliability of analytical results. This study aims to analyze the impact of this skew on sample representation in distributed environments and propose a practical framework for testing sampling integrity in large, fragmented databases. The study proposes a methodological framework known as the Reliable Sampling Framework (RSF), which combines stratified sampling and consistent fragmentation with a statistical validation mechanism based on the Kolmogorov–Smirnov test and Bootstrap to assess variance and estimate accuracy. The proposed framework was tested using a statistical simulation environment of a large database divided into several unequal parts to represent data skewness. The results showed that relying on simple random sampling in fragmented environments leads to significant statistical skewness compared to the distribution of the entire population, while the application of the proposed framework significantly reduced statistical skewness and improved the stability of mean estimates. Statistical tests also showed that combining stratified sampling with consistent segmentation achieves a more accurate representation of population distribution compared to traditional methods. The study's findings indicate that integrating statistical sampling principles with the characteristics of distributed structures can effectively improve the reliability of analyses in big data environments and provides a practical framework applicable to cloud analytics and distributed data-driven decision support systems.
