{"id":491426,"date":"2018-06-17T23:53:43","date_gmt":"2018-06-18T06:53:43","guid":{"rendered":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/?post_type=msr-research-item&#038;p=491426"},"modified":"2018-10-16T22:24:48","modified_gmt":"2018-10-17T05:24:48","slug":"crossbow-scaling-deep-learning-on-multi-gpu-servers","status":"publish","type":"msr-research-item","link":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/publication\/crossbow-scaling-deep-learning-on-multi-gpu-servers\/","title":{"rendered":"CrossBow: Scaling Deep Learning on Multi-GPU Servers"},"content":{"rendered":"<p>With the widespread availability of servers with 4 or more GPUs,\u00a0scalability in terms of the number of GPUs in a server when training\u00a0deep learning models becomes a paramount concern. Systems such\u00a0as TensorFlow and MXNet train using synchronous stochastic\u00a0gradient descent\u2014an input batch is partitioned across the GPUs,\u00a0each computing a partial gradient. The gradients are then combined\u00a0to update the model parameters before proceeding to the next\u00a0batch. For many deep learning models, this introduces a scalability<br \/>\nchallenge: to keep multiple GPUs fully utilised, the batch size must\u00a0be sufficiently large, but a large batch size slows down model convergence\u00a0due to the less frequent model updates, thus prolonging\u00a0the time to reach a desired level of accuracy.\u00a0This paper introduces CrossBow, a new single-server multiGPU\u00a0deep learning system that avoids the above trade-off. CrossBow\u00a0trains multiple model replicas concurrently on each GPU,\u00a0thereby avoiding under-utilisation of GPUs even when the preferred\u00a0batch size is small. For this, CrossBow must (i) decide on an\u00a0appropriate number of model replicas per GPU and (ii) employ\u00a0an efficient and scalable synchronisation scheme within and across\u00a0GPUs. CrossBow automatically tunes the number of replicas per\u00a0GPU at runtime to maximise training throughput for a given batch\u00a0size. We designed a novel synchronisation scheme that eliminates\u00a0dependencies among model replicas, enabling high throughput and\u00a0scalability. Our experiments show that CrossBow outperforms\u00a0TensorFlow on a 4-GPU server by 2.5\u00d7 with ResNet-32.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the widespread availability of servers with 4 or more GPUs,\u00a0scalability in terms of the number of GPUs in a server when training\u00a0deep learning models becomes a paramount concern. Systems such\u00a0as TensorFlow and MXNet train using synchronous stochastic\u00a0gradient descent\u2014an input batch is partitioned across the GPUs,\u00a0each computing a partial gradient. The gradients are then combined\u00a0to [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"Inaugural SysML Conference","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Inaugural SysML Conference","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2018-02-15","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-491426","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"","msr_edition":"Inaugural SysML Conference","msr_affiliation":"","msr_published_date":"2018-02-15","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"491429","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","title":"koliousis18crossbow","viewUrl":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-content\/uploads\/2018\/06\/koliousis18crossbow.pdf","id":491429,"label_id":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"text","value":"Alexandros Koliousis","user_id":0,"rest_url":false},{"type":"text","value":"Pijika Watcharapichat","user_id":0,"rest_url":false},{"type":"text","value":"Matthias Weidlich","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Paolo Costa","user_id":33218,"rest_url":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Paolo Costa"},{"type":"text","value":"Peter Pietzuch","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/491426","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/491426\/revisions"}],"predecessor-version":[{"id":491432,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/491426\/revisions\/491432"}],"wp:attachment":[{"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=491426"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=491426"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=491426"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=491426"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=491426"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=491426"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=491426"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=491426"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=491426"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=491426"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=491426"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=491426"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/new-cm-edgedigital.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=491426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}