Discover an index of datasets, SDKs, APIs and open-source tools developed by Microsoft researchers and shared with the global academic community below. These experimental technologies—available through Azure AI Foundry Labs (opens in new tab)—offer a glimpse into the future of AI innovation.
LIDA
LIDA is a library for generating grammar-agnostic (will work with any programming language and visualization libraries e.g. matplotlib, seaborn, altair, d3 etc) visualizations and infographics. LIDA comprises of 4 modules – A SUMMARIZER that converts…
Global Static Analysis With CodeQL
Resource Leak Checker (RLC#) for C# code using CodeQL—RLC# is a lightweight and modular resource leak checker for C# code. It is inspired by Checker Framework’s resource leak checker (RLC) for Java. RLC# is developed…
ReinMax
Bridging Discrete and Backpropagation: Straight-Through and Beyond—Guided by our findings, we propose a novel method called ReinMax, which integrates Heun’s Method, a second-order numerical method for solving ODEs, to approximate the gradient. Our method, ReinMax,…
VISOR
Benchmarking Spatial Relationships in Text-to-Image Generation—Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent large-scale text-to-image synthesis…
Vaccine Search Study
This repository contains code and data for “Accurate Measures of Vaccination and Concerns of Vaccine Holdouts from Web Search Logs” (2023) by Serina Chang, Adam Fourney, and Eric Horvitz.
CO-BED
CO-BED: Information-Theoretic Contextual Optimization via Bayesian Experimental Design. We formalize the problem of contextual optimization through the lens of Bayesian experimental design and propose CO-BED—a general, model-agnostic framework for designing contextual experiments using information-theoretic principles.
Continental United States Distributed Energy Resources (DER) Dataset
We present a dataset of distributed energy resources (DERs) for the contiguous U.S. using only publicly available data. The primary focus of the dataset is on distribution-level utility-scale and distributed solar and storage, given their…
Deep Language Networks
We view Large Language Models as stochastic language layers in a network, where the learnable parameters are the natural language prompts at each layer. We stack two such layers, feeding the output of one layer…