Sarah Alnegheimish’s research interests reside at the intersection of machine learning and systems engineering. Her objective: to make machine learning systems more accessible, transparent, and trustworthy.
Alnegheimish is a PhD student in Principal Research Scientist Kalyan Veeramachaneni’s Data-to-AI group in MIT’s Laboratory for Information and Decision Systems (LIDS). Here, she commits most of her energy to developing Orion, an open-source, user-friendly machine learning framework and time series library that is capable of detecting anomalies without supervision in large-scale industrial and operational settings.
Early influence
The daughter of a university professor and a teacher educator, she learned from an early age that knowledge was meant to be shared freely. “I think growing up in a home where education was highly valued is part of why I want to make machine learning tools accessible.” Alnegheimish’s own personal experience with open-source resources only increased her motivation. “I learned to view accessibility as the key to adoption. To strive for impact, new technology needs to be accessed and assessed by those who need it. That’s the whole purpose of doing open-source development.”
Alnegheimish earned her bachelor’s degree at King Saud University (KSU). “I was in the first cohort of computer science majors. Before this program was created, the only other available major in computing was IT [information technology].” Being a part of the first cohort was exciting, but it brought its own unique challenges. “All of the faculty were teaching new material. Succeeding required an independent learning experience. That’s when I first time came across MIT OpenCourseWare: as a resource to teach myself.”
Shortly after graduating, Alnegheimish became a researcher at the King Abdulaziz City for Science and Technology (KACST), Saudi Arabia’s national lab. Through the Center for Complex Engineering Systems (CCES) at KACST and MIT, she began conducting research with Veeramachaneni. When she applied to MIT for graduate school, his research group was her top choice.
Creating Orion
Alnegheimish’s master thesis focused on time series anomaly detection — the identification of unexpected behaviors or patterns in data, which can provide users crucial information. For example, unusual patterns in network traffic data can be a sign of cybersecurity threats, abnormal sensor readings in heavy machinery can predict potential future failures, and monitoring patient vital signs can help reduce health complications. It was through her master’s research that Alnegheimish first began designing Orion.
Orion uses statistical and machine learning-based models that are continuously logged and maintained. Users do not need to be machine learning experts to utilize the code. They can analyze signals, compare anomaly detection methods, and investigate anomalies in an end-to-end program. The framework, code, and datasets are all open-sourced.
“With open source, accessibility and transparency are directly achieved. You have unrestricted access to the code, where you can investigate how the model works through understanding the code. We have increased transparency with Orion: We label every step in the model and present it to the user.” Alnegheimish says that this transparency helps enable users to begin trusting the model before they ultimately see for themselves how reliable it is.
“We’re trying to take all these machine learning algorithms and put them in one place so anyone can use our models off-the-shelf,” she says. “It’s not just for the sponsors that we work with at MIT. It’s being used by a lot of public users. They come to the library, install it, and run it on their data. It’s proving itself to be a great source for people to find some of the latest methods for anomaly detection.”
Repurposing models for anomaly detection
In her PhD, Alnegheimish is further exploring innovative ways to do anomaly detection using Orion. “When I first started my research, all machine-learning models needed to be trained from scratch on your data. Now we’re in a time where we can use pre-trained models,” she says. Working with pre-trained models saves time and computational costs. The challenge, though, is that time series anomaly detection is a brand-new task for them. “In their original sense, these models have been trained to forecast, but not to find anomalies,” Alnegheimish says. “We’re pushing their boundaries through prompt-engineering, without any additional training.”
Because these models already capture the patterns of time-series data, Alnegheimish believes they already have everything they need to enable them to detect anomalies. So far, her current results support this theory. They don’t surpass the success rate of models that are independently trained on specific data, but she believes they will one day.
Accessible design
Alnegheimish talks at length about the efforts she’s gone through to make Orion more accessible. “Before I came to MIT, I used to think that the crucial part of research was to develop the machine learning model itself or improve on its current state. With time, I realized that the only way you can make your research accessible and adaptable for others is to develop systems that make them accessible. During my graduate studies, I’ve taken the approach of developing my models and systems in tandem.”
The key element to her system development was finding the right abstractions to work with her models. These abstractions provide universal representation for all models with simplified components. “Any model will have a sequence of steps to go from raw input to desired output. We’ve standardized the input and output, which allows the middle to be flexible and fluid. So far, all the models we’ve run have been able to retrofit into our abstractions.” The abstractions she uses have been stable and reliable for the last six years.
The value of simultaneously building systems and models can be seen in Alnegheimish’s work as a mentor. She had the opportunity to work with two master’s students earning their engineering degrees. “All I showed them was the system itself and the documentation of how to use it. Both students were able to develop their own models with the abstractions we’re conforming to. It reaffirmed that we’re taking the right path.”
Alnegheimish also investigated whether a large language model (LLM) could be used as a mediator between users and a system. The LLM agent she has implemented is able to connect to Orion without users needing to know the small details of how Orion works. “Think of ChatGPT. You have no idea what the model is behind it, but it’s very accessible to everyone.” For her software, users only know two commands: Fit and Detect. Fit allows users to train their model, while Detect enables them to detect anomalies.
“The ultimate goal of what I’ve tried to do is make AI more accessible to everyone,” she says. So far, Orion has reached over 120,000 downloads, and over a thousand users have marked the repository as one of their favorites on Github. “Traditionally, you used to measure the impact of research through citations and paper publications. Now you get real-time adoption through open source.”
PhD student Sarah Alnegheimish wants to make machine learning systems accessible.