Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving

San Juan, Justin David Quitalig

dc.contributor.author	San Juan, Justin David Quitalig
dc.date.accessioned	2023-08-24 12:50:56 (GMT)
dc.date.available	2023-08-24 12:50:56 (GMT)
dc.date.issued	2023-08-24
dc.date.submitted	2023-08-18
dc.identifier.uri	http://hdl.handle.net/10012/19748
dc.description.abstract	Recent breakthroughs in Deep Learning (DL) have led to high demand for executing inferences in interactive services such as ChatGPT and GitHub Copilot. However, these interactive services require low-latency inferences, which can only be met with GPUs and result in exorbitant operating costs. For instance, ChatGPT reportedly requires millions of U.S. dollars in cloud GPUs to serve its 1+ million users. A potential solution to meet low-latency requirements with acceptable costs is to use serverless platforms. These platforms automatically scale resources to meet user demands. However, current serverless systems have long cold starts which worsen with larger DL models and lead to poor performance during bursts of requests. Meanwhile, the demand for larger and larger DL models make it more challenging to deliver an acceptable user experience cost-effectively. While current systems over-provision GPUs to address this issue, they incur high costs in idle resources which greatly reduces the benefit of using a serverless platform. In this thesis, we introduce Flashpoint, a GPU-based serverless platform that serves DL inferences with low latencies. Flashpoint achieves this by reducing cold start durations, especially for large DL models, making serverless computing feasible for latency-sensitive DL workloads. To reduce cold start durations, Flashpoint reduces download times by sourcing the DL model data from within the compute cluster rather than slow cloud storage. Additionally, Flashpoint minimizes in-cluster network congestion from redundant packet transfers of the same DL model to multiple machines with multicasting. Finally, Flashpoint also reduces cold start durations by automatically partitioning models and deploying them in parallel on multiple machines. The reduced cold start durations achieved by Flashpoint enable the platform to scale resource allocations elastically and complete requests with low latencies without over-provisioning expensive GPU resources. We perform large-scale data center simulations that were parameterized with measurements our prototype implementations. We evaluate the system using six state-of-the-art DL models ranging from 499 MB to 11 GB in size. We also measure the performance of the system in representative real-world traces from Twitter and Microsoft Azure. Our results in the full-scale simulations show that Flashpoint achieves an arithmetic mean of 93.51% shorter average cold start durations, leading to 75.42% and 66.90% respective reductions in average and 99th percentile end-to-end request latencies across the DL models with the same amount of resources. These results show that Flashpoint boosts the performance of serving DL inferences on a serverless platform without increasing costs.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	elastic scaling	en
dc.subject	serverless	en
dc.subject	deep learning	en
dc.subject	inference serving	en
dc.subject	GPU	en
dc.subject	cold start	en
dc.subject	performance	en
dc.subject	PyTorch	en
dc.subject	cloud systems	en
dc.subject	model partitioning	en
dc.subject	locality	en
dc.subject	multicast	en
dc.title	Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Mathematics	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Wong, Bernard
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: SanJuan_JustinDavidQuitalig.pdf
Size:: 5.316Mb
Format:: PDF
Description:: Main article

View/ Open

This item appears in the following Collection(s)

Show simple item record