Prediction model approaches for data transfer throughput estimation and optimization
MetadataShow full item record
All areas of science and industry have been generating increasingly complex big data at the scales of petabytes and beyond. Despite the trend of moving the application to the data rather than the data to the application, large datasets still need to be moved around for increased availability, performance, and recovery purposes. Sharing, disseminating, and analyzing these large datasets have become a big challenge, despite the deployment of petascale computing systems and optical networking speeds reaching up to 100 Gbps. The majority of users fail to obtain even a fraction of the theoretical speeds promised by these high-bandwidth networks due to issues such as sub-optimal protocol tuning, inefficient end-to-end routing, disk performance bottlenecks on the sending and/or receiving ends, and server processor limitations. Different protocol parameters such as TCP pipelining, parallelism and concurrency levels play a significant role in the achievable network throughput. However, setting the optimal numbers for these parameters is a challenging problem, since poorly-tuned parameters can either cause underutilization of the network or they can overburden the network and degrade the performance due to increased packet loss, end-system overhead, and other factors. In this dissertation, we develop application-level models to predict the best combination of protocol parameters for optimal network performance. The tuned network parameters include the number of parallel data streams per file (for large file optimization), the level of control and data channel pipelining (for small file optimization), and the level of concurrent file transfers to fill the long fat network pipes (for all files). We start with presenting a model to decide the optimal sampling size for data transfer optimization based on the dataset size and the estimated capacity of the network. This model helps us to generate the smallest possible sampling size with highest accuracy in any given data transfer setting. Using this sampling size model, we develop a parallel stream prediction model, called "Full C-order", for data transfer throughput optimization. Full C-order outperforms all existing parallel stream prediction models by achieving higher accuracy with much lower sampling overhead. Extending these two models, we develop a combined parameter optimization model, called "PCP", which optimizes the pipelining and concurrency parameters in addition to parallelism. Two variations of the combined PCP model are developed: i) PCP-realtime: assumes no historical data is available, and is purely based on real-time sampling; ii) PCP-historical: assumes some historical data is available, and uses both this data and some real-time sampling. We test and evaluate our throughput optimization models on a variety of testbeds, including emulated environments (such as Emulab and CRON), on production environments (such as XSEDE, FutureGrid, and LONI), as well as in our local distributed computing system (DIDCLab) using a wide variety of dataset sizes, Round-Trip-Time (RTT), and bandwidth combinations. Our comprehensive experiments confirm the superiority of our models to the existing models in this area.