Postingan

Inside Bento: Jupyter Notebooks at Meta

Gambar
This episode of the Meta Tech Podcast is all about Bento, Meta’s internal distribution of Jupyter Notebooks, an open-source web-based computing platform. Bento allows our engineers to mix code, text, and multimedia in a single document and serves a wide range of use cases at Meta from prototyping to complex machine learning workflows. Pascal Hartig [...] Read More... The post Inside Bento: Jupyter Notebooks at Meta appeared first on Engineering at Meta. http://dlvr.it/TDMB5y

Simulator-based reinforcement learning for data center cooling optimization

Gambar
We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls. Our reinforcement learning-based approach has helped us reduce energy consumption and water usage across various weather conditions.   Meta is revamping its new data center design to optimize for artificial intelligence and the same methodology will be [...] Read More... The post Simulator-based reinforcement learning for data center cooling optimization appeared first on Engineering at Meta. http://dlvr.it/TD4BJ2

Read Meta’s 2024 Sustainability Report

Gambar
[...] Read More... The post Read Meta’s 2024 Sustainability Report appeared first on Engineering at Meta. http://dlvr.it/TCqSsP

Meta is getting ready for post-quantum cryptography

Gambar
The Quantum Apocalypse is coming. The advent of quantum computers has raised real questions about the future of data privacy over the internet.  Someday, advances in quantum computing will make it possible to decrypt sensitive data that was encrypted using today’s complex cryptography systems. In the latest episode of the Meta Tech Podcast you’ll meet Sheran [...] Read More... The post Meta is getting ready for post-quantum cryptography appeared first on Engineering at Meta. http://dlvr.it/TCVChg

How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale

Gambar
At Meta, we’ve been diligently working to incorporate privacy into different systems of our software stack over the past few years. Today, we’re excited to share some cutting-edge technologies that are part of our Privacy Aware Infrastructure (PAI) initiative. These innovations mark a major milestone in our ongoing commitment to honoring user privacy.  PAI offers [...] Read More... The post How Meta enforces purpose limitation via Privacy Aware Infrastructure at scale appeared first on Engineering at Meta. http://dlvr.it/TCRbJ9

RETINAS: Real-Time Infrastructure Accounting for Sustainability

Gambar
We are introducing a new metric— real-time server fleet utilization effectiveness —as part of the RETINAS initiative to help reduce emissions and achieve net zero emissions across our value chain in 2030. This new metric allows us to measure server resource usage (e.g., compute, storage) and efficiency in our large-scale data center server fleet in [...] Read More... The post RETINAS: Real-Time Infrastructure Accounting for Sustainability appeared first on Engineering at Meta. http://dlvr.it/TCPLW6

How PyTorch powers AI training and inference

Gambar
Learn about new PyTorch advancements for LLMs and how PyTorch is enhancing every aspect of the LLM lifecycle. In this talk from AI Infra @ Scale 2024, software engineers Wanchao Liang and Evan Smothers are joined by Meta research scientist Kimish Patel to discuss our newest features and tools that enable large-scale training, memory efficient [...] Read More... The post How PyTorch powers AI training and inference appeared first on Engineering at Meta. http://dlvr.it/TCHqp0

Inside the hardware and co-design of MTIA

Gambar
In this talk from AI Infra @ Scale 2024, Joel Colburn, a software engineer at Meta, technical lead Junqiang Lan, and software engineer Jack Montgomery discuss the second generation of MTIA, Meta’s in-house training and inference accelerator. They cover the co-design process behind building the second generation of Meta’s first-ever custom silicon for AI workloads, [...] Read More... The post Inside the hardware and co-design of MTIA appeared first on Engineering at Meta. http://dlvr.it/TCFYPv

Bringing Llama 3 to life

Gambar
Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.1 will enable new workflows, such as synthetic data generation and model distillation with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models.  At AI Infra @ Scale 2024, Meta engineers discussed every step of how we [...] Read More... The post Bringing Llama 3 to life appeared first on Engineering at Meta. http://dlvr.it/TCCC1D

Aparna Ramani discusses the future of AI infrastructure

Gambar
Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even our data center designs. For the second year in a row, Meta’s engineering and infrastructure teams returned for the AI Infra @ Scale conference, where they discussed the challenges of scaling up an [...] Read More... The post Aparna Ramani discusses the future of AI infrastructure appeared first on Engineering at Meta. http://dlvr.it/TC8WPd

How Meta animates AI-generated images at scale

Gambar
We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI). But GenAI also comes with challenges of scale. As we deploy new GenAI technologies at Meta, we also focus on delivering these services to people as quickly and efficiently as possible. [...] Read More... The post How Meta animates AI-generated images at scale appeared first on Engineering at Meta. http://dlvr.it/TBwnL0

A RoCE network for distributed AI training at scale

Gambar
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia, we are sharing details on the network we have built at Meta [...] Read More... The post A RoCE network for distributed AI training at scale appeared first on Engineering at Meta. http://dlvr.it/TBXCRm

DCPerf: An open source benchmark suite for hyperscale compute applications

Gambar
We are open-sourcing DCPerf, a collection of benchmarks that represents the diverse categories of workloads that run in data center cloud deployments. We hope that DCperf can be used more broadly by academia, the hardware industry, and internet companies to design and evaluate future products. DCPerf is available now on GitHub. Hyperscale and cloud datacenter [...] Read More... The post DCPerf: An open source benchmark suite for hyperscale compute applications appeared first on Engineering at Meta. http://dlvr.it/TBXC3t

Meet Caddy – Meta’s next-gen mixed reality CAD software

Gambar
What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom? Meet the team behind Caddy, a new CAD app for mixed reality. They join Pascal Hartig (@passy) on the Meta Tech Podcast to talk about teaching themselves to code, disrupting the CAD software space, and [...] Read More... The post Meet Caddy – Meta’s next-gen mixed reality CAD software appeared first on Engineering at Meta. http://dlvr.it/T9mJpz

AI Lab: The secrets to keeping machine learning engineers moving fast

Gambar
The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and automatically preventing regressions on TTFB.  AI Lab prevents TTFB regressions [...] Read More... The post AI Lab: The secrets to keeping machine learning engineers moving fast appeared first on Engineering at Meta. http://dlvr.it/T9gRb1

Taming the tail utilization of ads inference at Meta scale

Gambar
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability.  Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for [...] Read More... The post Taming the tail utilization of ads inference at Meta scale appeared first on Engineering at Meta. http://dlvr.it/T9QxWF

Meta’s approach to machine learning prediction robustness

Gambar
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupted benefit delivery to our users and advertisers. To minimize disruptions and ensure our ML systems are intrinsically [...] Read More... The post Meta’s approach to machine learning prediction robustness appeared first on Engineering at Meta. http://dlvr.it/T9PxWg

The key to a happy Rust/C++ relationship

Gambar
The history of Rust at Meta goes all the way back to 2016, when we first started using it for source control. Today, it has been widely embraced at Meta and is one of our primary supported server-side languages (along with C++, Python, and Hack). But that doesn’t mean there weren’t any growing pains. Aida [...] Read More... The post The key to a happy Rust/C++ relationship appeared first on Engineering at Meta. http://dlvr.it/T8lmgy

Leveraging AI for efficient incident response

Gambar
We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our testing has shown this new system achieves 42% accuracy in identifying root causes for investigations at their [...] Read More... The post Leveraging AI for efficient incident response appeared first on Engineering at Meta. http://dlvr.it/T8jLB2

PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters

Gambar
We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems’ vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, and even extended to the training phase of AI models. We’re sharing results of our own [...] Read More... The post PVF: A novel metric for understanding AI systems’ vulnerability against SDCs in model parameters appeared first on Engineering at Meta. http://dlvr.it/T8VMWQ