Efficient and Scalable Metadata Access for Distributed Applications from Edge to the Cloud
MetadataShow full item record
We are witnessing a new era that offers new opportunities to conduct data-intensive scientific research with the help of recent advancements in computational, storage, and network technologies. With the rapid deployment of distributed infrastructures and the collaborations between different organizations, it is feasible and promising to run scientific applications on large-scale geo-distributed infrastructures. In many application domains including environmental and coastal hazard prediction, climate modeling, high-energy physics, astronomy, and genome mapping, the volume of data generated already exceeds petabytes, while the corresponding metadata amounts to terabytes or even more.Even though most studies have been conducted on physical data transferring, there has been little work focusing on remotely accessing and transferring large-scale meta- data in wide-area networks. Considering wide-area network latency, the frequency of revalidation of metadata, and rapid growth of Internet of Things (IoT), a novel metadataaccess and transferring mechanism is demanded and becomes a cornerstone of modern distributed IT infrastructures.In this dissertation, we propose a novel solution for efficient and scalable metadata access for distributed applications across wide-area networks. Our solution combines novel pipelining and concurrent transfer mechanisms with reliability, provides dis- tributed continuum caching and prefetching strategies to sidestep fetching latency, and achieves scalable and high-performance stateless fetch/prefetch services in the Cloud. Besides optimizing the metadata transfer performance, we also study the phenomenon of semantic locality in real trace logs which is not well utilized in metadata access prediction. We implement our predictor based on this observation and compare it with three existing state-of-art prefetch schemes (NEXUS, AMP, FARMER) on Yahoo! Hadoop audit traces.By effectively caching and prefetching metadata based on the access pattern, our continuum caching and prefetching mechanism greatly improves local cache hit rate and reduces the average fetching latency. We replayed approximately 20 Million metadata access operations from real audit traces, in which our system achieved 80% accuracy during prefetch prediction and reduced the average fetch latency 50% compared to the state-of-the-art mechanisms.