Metadata Management Service for a Cloud-hosted Data Scheduling System
MetadataShow full item record
The Stork data scheduler addresses the constant need for relocating "big-data" across widely distributed computational and storage resources. However providing this service comes at a cost of tougher guarantees on data access rates due to a strong lack of locality. This is seen with traditional protocols like FTP and SFTP where a centralized model is followed, involving 2 separate data movement sessions. The transfer model adopted by Globus addresses this via its GridFTP protocol, allowing transfers to occur in the shortest manner possible using 3rd party support. While each of these protocols have their demerits, they contribute to the bulk of wide area operations. Stork places the ability to schedule these multi-protocol transfers between different storage nodes. However, users who need to access data profiles of in-transfer jobs would still suffer from WAN latencies. The crux of our thesis is to address this issue, taking cues from some of the best metadata management techniques known to wide area file system research. Metadata management for large storage systems comprises, among others, two distinct challenges: handling large sets of structured data, improving read performance affected by a voluminous search space. Concurrency is also an issue with standard solutions but may be dealt with differently according to user or job requirements. In lieu of this, we have added a flexibility with metadata pull requests to allow stale information to reflected, conserving network and processing bandwidth. Using the NoSQL storage paradigm with a strong caching policy we demonstrate the ability of the Stork DLS to handle varied requests with minimum latency. On top of the current server caching mechanism we pro-actively support client side caching with a data prefetching policy. In this thesis, we present the performance improvements brought to Stork via these adopted policies, provide an analysis of the current architecture, contrast them with existing systems such as Globus Online and other file transfer tools. We also discuss future work that can be undertaken to improve both the performance and user experience.