Data Parallelism: How to Train Deep Learning Models on Multiple GPUs

The NVIDIA AI Technology Center at the University of Florida is offering an instructor-led, deep learning institute workshop in April: Data Parallelism: How to Train Deep Learning Models on Multiple GPUs.

Workshop Dates: April 11-12, 2024 (Thursday and Friday), from 1:00-5:00 p.m.

Registration Link: https://forms.gle/KiNxdjqxJ7AZCZFk6

The workshop will be held over two days (four hours each day) in Malachowsky Hall’s NVIDIA Auditorium. Its focus is on techniques for data-parallel deep learning training on multiple GPUs to shorten the training time required for data-intensive applications. Working with deep learning tools, frameworks, and workflows to perform neural network training, attendees will learn how to decrease model training time by distributing data to multiple GPUs, while retaining the accuracy of training on a single GPU. The full course outline may be found on this NVIDIA website page.

The course is FREE and open to the university community, but pre-registration is required. Also required is experience with Python. Technologies used in the workshop are PyTorch, PyTorch Distributed Data Parallel, and NCCL.

If you have any questions about this workshop, please email the instructor, NVIDIA Data Scientist Yungchao Yang (yunchaoyang@ufl.edu).

 

‘Hero’ Calculation Capability Yields Significant Achievement

Basic biology textbooks will tell you that all life on Earth is built from four types of molecules: proteins, carbohydrates, lipids, and nucleic acids.  But what if we could actually show that these “molecules of life,” such as amino acids and DNA bases, can be formed naturally in the right environment? Researchers at the University of Florida are using HiPerGator – the fastest supercomputer in U.S. higher education – to test this experiment. 

“Our previous success enabled us to use Machine Learning and AI to calculate energies and forces on molecular systems, with results that are identical to those of high-level quantum chemistry but around 1 million times faster,” said Adrian Roitberg, Ph.D., a professor in UF’s Department of Chemistry who has been using Machine Learning to study chemical reactions for six years. “These questions have been asked before but, due to computational limitations, previous calculations used small numbers of atoms and could not explore the range of time needed to obtain results. But with HiPerGator, we can do it.” 

HiPerGator – with its AI models and vast capacity for Graphics Processing Units, or GPUs (specialized processors designed to accelerate graphics renderings) – is transforming the molecular research game. Until a decade ago, conducting research on the evolution and interactions of large collections of atoms and molecules could only be done using simple computer simulation experiments; the computing power needed to handle the datasets just wasn’t available.  Read the full press release here.

UFIT Senior Director Erik Deumens explained how this full takeover of HiPerGator was possible: 

“HiPerGator has the unique capability to run very large ‘hero’ calculations that use the entire machine, with the potential to lead to breakthroughs in science and scholarship,” Deumens said. “When we found out about the work Dr. Roitberg’s group was doing, we approached him to try a ‘hero’ run with the code he developed.” 

Researchers interested in discussing using HiPerGator for hero calculations are welcome to contact Dr. Deumens.

UFIT Announces Spring Research Computing Training Schedule

This semester’s Research Computing training schedule is packed with a variety of HiPerGator, Practicum AI workshops free for faculty, lab staff , postdoctoral candidates, and students.

Traditional, single-session Research Computing training options will be held on Thursdays in person and online from 10:40 a.m. – 12:00 p.m. Sessions include Introduction to Research Computing & HiPerGator, SLURM Submission Scripts, and Jupyter Notebook and Managing Conda Environments. A three-day Git and GitHub workshop in March, developed by Drs. Catia Silva and Matt Gitzendanner, will feature hands-on activities with no coding background or prerequisites required.

Practicum AI is returning this Spring with two beginner course series: Deep Learning Foundations and Python for AI. Both training series are intended for participants with limited experience who want to explore using applied AI. All Practicum AI sessions will be available via Zoom or in person at Malachowsky Hall’s NVIDIA Auditorium (room 1000).

Visit https://rc.ufl.edu/calendar/ to view the full training schedule and register for any of the workshops. Anyone with questions about Research Computing training, or who is interested is setting up a custom training for their lab team or a class, is welcome to contact Training Team Lead Dr. Matt Gitzendanner.

Powering and Cooling HiPerGator: The UF Data Center

HiPerGator, the University of Florida supercomputer, is housed in the UF Data Center (UFDC). While its power and ranking as the most powerful supercomputer in U.S. higher education is well known, not many people know about the components at the UFDC that help keep HiPerGator online and cooled.

Backup Batteries

HiPerGator and the other computers housed in the UFDC, along with the chilled water pumps and air handlers, are run by high-power batteries. These batteries ensure that the computers get clear power without spikes or brown-outs. There is enough power available in the UFDC to keep all systems operating for about 10 minutes after an external utilities power failure. During those 10 minutes, UFDC diesel generators begin providing continuing power. The diesel generator and the chillers cool their water to 55F to send to the air handlers, which then cool the air that is used to cool the computers.

Air Exchange

To get fresh air throughout the UFDC and avoid sick-building syndrome, 10% of the air inside the data hall is constantly replaced with outside air, which is cleaned by removing particles and living mold and spores.

UF Data Center Generators

The UFDC has two generators. One has a horse-power capacity of 2.25 MW and produces 1 MW of electricity if the utilities’ power becomes unavailable. A second, similar 4 MW diesel produces the remaining 2.2 MW of electricity to provide the full 3.2 MW that the UFDC is rated for.

Transparent Floor Tiles

The HiPerGator room has a raised floor of about three feet. This is because the mostly empty space is needed to allow cold air to be delivered to the front of the computers. The fans inside the computers blow the cold air past the hot CPUs, with the hot air being returned through the ceiling to the air handlers in hallways outside the 5000 sq. ft. HiPerGator room.

Air Handlers

Speaking of the air handlers, they blow hot air past the radiators that have 55F water flowing through them. All 125,000 cubic feet of air in the HiPerGator data hall must be replaced twice every minute to avoid HiPerGator overheating! The ideal temperature for the HiPerGator room? It is 60F.

Even with the cooling requirements for a supercomputer, HiPerGator is ranked high up on the worldwide green-500 computing list, and the UF Data Center is a certified LEED® building. Learn more about HiPerGator here.

First Event in Malachowsky Hall’s NVIDIA Auditorium

The Malachowsky Hall for Data Science and Information Technology (DSIT) is a 263,000 sq. ft. academic and research collaboration building for AI and machine learning innovation. Named for UF alumnus and NVIDIA co-founder Chris Malachowsky, it seems very appropriate that the first event in DSIT’s NVIDIA Auditorium is an NVIDIA workshop:

Title: Synthetic Data Generation for Training Computer Vision Models
Date: Friday, Oct. 20 │ 9:00 – 12:30 p.m.
Location: NVIDIA Auditorium, Malachowsky Hall Rm. 1000

To register email UFIT Communications with your name, UFID number, and home department or lab affiliation. The workshop is part of NVIDIA’s Deep Learning Institute and will be taught by an NVIDIA instructor. The full synopsis, including links to review prior to the workshop, is available here.

NOTE: Registrants must complete additional NVIDIA steps to be fully registered for the Oct. 20 workshop. Be sure to read the synopsis and take the appropriate steps provided to ensure your NVIDIA Developer Program account is activated and your DLI cloud space is ready for you to fully engage in the workshop. Anyone with questions about this workshop is welcome to contact UFIT’s AI Support Manager Ying Zhang.

Getting Started with HiPerGator

To assist researchers and instructors in getting started with HiPerGator, UFIT produced a series of videos that explain the processes for setting up a HiPerGator account, training and support for UF’s high-performance computing environment, and using HiPerGator in undergraduate courses:

Getting Started with HiPerGator

Teaching with HiPerGator

UFIT also has a video explaining what ResVault is. That system can be used for computing on highly regulated data like export controlled data. HiPerGator is also certified to allow working with PHI if the proper procedure is followed.

Our Research Computing staff look forward to meeting you and enabling your line of inquiry. You’ll find many additional resources on the https://rc.ufl.edu/ website to help you begin your journey in UF’s high-performance computing environment, and staff are available for in-person and online consultations as needed to fit your schedule. Please contact Senior Director Erik Deumens if you have any questions about getting started with HiPerGator and our campus’s research computing ecosystem.

Multiple Storage Options for Research

Storage provided by UFIT’s Research Computing department is for research and educational data, code, and documents used on HiPerGator and its ecosystem. Registered HiPerGator users are allotted 40GB of storage in their home directory, but depending on the research project, more storage may be needed. To support research and discovery, UFIT manages three DDN EXAScaler filesystems and offers three tiers of additional storage–blue, orange, and red.

Blue for job input/output
Orange for “warm” storage
Red for Nvidia DGX A100 SuperPod workflows

Faculty with long-term projects should become familiar with the service levels included with each tier level. The storage offerings are described on the storage use policy page.
Access to additional storage resources is obtained either as a hardware investment or service investment. Learn more about HiPerGator hardware and service investments here:
https://rc.ufl.edu/get-started/purchase-allocation/. Typically, the turnaround time for provisioning additional storage resources is two to three business days.

Researchers from UF, SUS institutions, or SEC universities who would like a consult about their project’s storage needs are welcome to contact the Research Computing staff.

HiPerGator Achieves HITRUST Certification

HiPerGator joins an elite group of university supercomputers that has earned the HITRUST r2 certification. HITRUST certification confirms that UF meets all international security and compliance requirements for data protection and can process large amounts of sensitive data and personal information, including patient health information (PHI). To set up a project on HiPerGator that works with PHI, researchers must still adhere to all policies and procedures listed on this webpage.

HITRUST certification is a way for universities, scientific organizations, and others to demonstrate that specific systems within their environment meet the framework’s rigorous standards and requirements. To achieve certification, independent assessors perform extensive testing and verification of hardware systems, networks, software, procedures, and processes to ensure that the system operates as described in documentation and policies. The HITRUST r2 assessment level is the most strenous review available and provides the highest level of assurance for organizations to manage their risk.

HiPerGator went online in 2013 for research on open data. The HiPerGator-RV enclave earned NIST 800-171 and NIST 800-53 compliance in 2017. Anyone with questions about UF’s HITRUST certification may contact Research Computing Director Dr. Erik Deumens.

UF-NVIDIA Hackathon: May 17-25

“Attending the 2023 Hackathon will help our team optimize our models to run on HiPerGator and increase their efficiency and performance,” wrote Warrington College of Business Assistant Professor Ivy Munoko. “We use a large dataset with tens of millions of data points.”

Partnering with NVIDIA and OpenACC, the second annual UF-NVIDIA GPU Hackathon began this week. Ten teams of computational researchers and developers are participating, including three external teams representing the National Oceanic and Atmospheric Administration, the University of Alabama, and Arizona State University. Each team is receiving mentorship in GPU programming, high-performance computing, and data applications from NVIDIA and UFIT staff. Professor Munoko’s team includes Karla Saldaña Ochoa, assistant professor, College of Design, Construction, and Planning, and Maxim Terekhov, Ph.D. candidate, Department of Information Systems and Operations Management.

The hackathon is an opportunity to port, accelerate, and optimize scientific applications with programming models and tools hosted through HiPerGator. Participants are also developing a deeper understanding of HiPerGator’s computational capabilities while utilizing applications on the latest supercomputing hardware. Researchers with questions about the hackathon or who would like to schedule a consult about UF-AI computing support may contact Applications Specialist and AI Support Team Lead Ms. Ying Zhang.

Practicum AI Offered This Summer

The Practicum AI program will be offered this summer, from June 7–July 12. Practicum AI is led by Training and Biocomputing Specialist Dr. Matt Gitzendanner.

Practicum AI is a hands-on, applied AI curriculum developed for participants with a limited coding and math background. Using hands-on exercises and graphically-based, conceptual content, learners without extensive computational skills can begin exploring applied AI. While all sessions will be available via Zoom, registrants are encouraged to attend in person for the best opportunity to learn, interact with instructors and fellow students, and to ask questions. Practicum AI will be held in the UF Informatics Institute. Registration closes May 31. Visit this link to register.

Getting Started with AI │ June 7, 1-5pm
Introduction to artificial intelligence, how it can be applied in diverse disciplines, and some key ethical considerations.
Computing for AI │ June 14, 12:30-5pm
Getting started with the foundational tools used in AI research, including Jupyter Notebooks, Git and GitHub.com, and computer clusters, like HiPerGator.
Python for AI │ July 6 and July 7, 1-5pm
Introduction to the basics of Python programming, which is the predominant language used in AI. The course assumes no prior programming experience. Participants will learn the basics of Python to begin using AI frameworks for AI research.
Deep Learning Foundations (DLF) │ July 10, July 12, and July 13, 2-3:30pm
Introduction to neural networks–how they work and how to train them. Students must attend the July 6-7 Python course to participate in the three-day DLF course.