October 17, 2024
Learn more about ActiveFence's powerful, precise detection models
When it comes to evaluating the performance of detection models for content moderation, precision and recall are often the most cited metrics. However, in the complex and evolving world of trust and safety, relying solely on these metrics can provide an incomplete picture, especially when considering operational needs and user safety. This blog post will explore different ways of evaluating detection models and how other trust and safety solution components can help optimize performance, efficiency, and ultimately, user safety.
Core Findings from Our Health Assessments
While precision and recall are essential for understanding how well a model performs in a lab setting, they are just the starting off point for measuring operational efficiency.
- Precision measures how many of the items flagged as harmful are actually harmful (i.e., the percentage of true positives among all flagged items).
- Recall measures how many of the actual harmful items were successfully detected by the model (i.e., the percentage of true positives among all real harmful instances).
One way we evaluate and address efficiency beyond precision and recall is through our Health Assessments, which provide a tailored analysis of a potential partner platform’s moderation needs by applying ActiveFence policies to a representative sample of their data. This helps uncover the greatest areas of potential efficiency gains and confirms the prevalence of different types of harm.
Having conducted dozens of these health assessments over the years, we have aggregated some of the collected data to gain a deeper understanding of core metrics for operational efficiency.
A key insight from our Health Assessments is that only 5% of platform data is actually harmful. This emphasizes the importance of operational efficiency in trust and safety —most content isn’t harmful, which means optimizing the way we handle the small percentage that is problematic can lead to significant improvements in both workload and resource allocation.
Different Approaches to High Risk vs. High Prevalence Harms
One of the key challenges in trust and safety operations is dealing with both high-risk harms (e.g., child exploitation, terrorist content) and high-prevalence harms (e.g., spam, misinformation). Evaluating detection models in this context means balancing the need to catch harmful content with the potential risk of false positives.
- High-risk harms are those that, even at low prevalence, can cause severe damage. For these types of harms, high recall is crucial—we want to ensure we detect as much of this content as possible, even if it means increasing the rate of false positives. For example, in detecting white nationalism content, the focus is on identifying a higher volume of true positives by recognizing user patterns associated with such content, even at the risk of additional manual reviews.
- High-prevalence harms, like profanity or spam, can often be managed more effectively through automation. A key finding from our health assessments is that about 88% of harmful content falls within this category, and is a good fit for automation.
Spam is an example of a high-prevalence harm that may be hard to detect and action on manually. This is because it is often hidden by coded terminology and based on a high frequency of messages – that are often not violative if looked on as their own, making it impossible for moderators to review these volumes manually. This is where automation can kick in.
In one example, a client we worked with found that 50% of the spam shared on its platform was generated by just 1% of its flagged users. One spammer sent the same message 70,000 times in less than two weeks. By understanding these user patterns, that client was able to use automation to reduce the burden of manually handling repetitive and disruptive spam by 80%.
Going Beyond Model Metrics: Efficiency and Operational Impact
When looking at both standard efficiency metrics and different violation types, it becomes clear that precision and recall don’t fully capture the operational efficiency that trust and safety teams need to achieve.
Alternatively, operational efficiency can be evaluated by looking at how effectively a moderation system handles content at scale and supports moderators in managing harms. For instance:
- User-Level Actions: Evaluating the effectiveness of a model should also consider its ability to drive actions at the user level. Models can identify not just harmful content but also harmful users. For example, we found that for many platforms, removing a small number of users responsible for the majority of spam can reduce overall harmful content by over 50%. This highlights the importance of metrics beyond detection, such as reduction in harmful behaviors and user-level impact.
- Automation and Human Review Balance: Another important aspect of model evaluation is the interplay between automation and human moderation. Detection models should strive to push clearly harmful content to automated actioning, while reserving the edge cases and complex decisions for human moderators. A good metric here might be moderator workload reduction or increased actionability—essentially, how much content is efficiently processed without human intervention, and how well automation is freeing up human moderators for nuanced tasks.
How T&S Solution Components Improve Efficiency
It’s clear from the above that evaluating detection model metrics should not happen in isolation. The broader trust and safety ecosystem—that includes automations, moderator tooling, and user-level risk scoring—plays a crucial role in determining the overall effectiveness of harmful content mitigation strategies.
- Automation-Driven Efficiency: When considering efficiency, it’s crucial to assess how well detection models integrate with automation workflows. For example, ActiveOS provides automation paths that are based on risk scores, such as automatically removing content identified as spam or applying a user-level warning or ban. These workflows can lead to immediate operational gains, reducing the volume of manual reviews required by up to 80-90% for specific types of content.
- User-Level Risk Scoring: Incorporating user-level risk scores into evaluation frameworks can also provide valuable insights. By using detection models to identify high-risk users—such as those with a history of repeated harmful behavior—platforms can take proactive actions, such as issuing warnings or applying stricter content monitoring to that specific user. This helps in not only reducing the volume of harmful content but also in identifying behavioral patterns that indicate a high likelihood of future harm.
Evaluating Success Holistically
Ultimately, evaluating detection models should be about understanding their impact on both safety outcomes and operational efficiency. Key performance indicators might include:
- Reduction in Harmful Content: Measuring the percentage reduction in harmful content after the deployment of automated workflows.
- Moderator Efficiency Gains: Evaluating how well models help reduce the workload of moderators by automating predictable tasks.
- User-Level Impact: Assessing how interventions at the user level (e.g., bans, warnings) lead to a reduction in overall platform harm.
The integration of models into a larger trust and safety strategy means we move beyond a narrow focus on model precision and recall. By including metrics that focus on impact, efficiency, and holistic safety, we gain a clearer picture of how well a model performs in the real world—ensuring that we not only detect harm but also mitigate it effectively and sustainably.
Conclusion
Precision and recall are important metrics for evaluating detection models, but they are only part of the story. Trust and safety teams must also consider the operational efficiency of their solutions, the balance between automation and human intervention, and the user-level actions enabled by their detection models. By expanding the evaluation criteria, platforms can ensure that their trust and safety operations are strategic, scalable, and capable of keeping their communities safe in a constantly evolving landscape.
Table of Contents
Enhance moderation efficiency with precise models and efficiency-building features in ActiveOS.
Related Content
Generative AI
ActiveFence and NVIDIA Ensure Safe GenAI Solutions
ActiveFence is proud to be the first AI Safety vendor in NVIDIA's NeMo Guardrails, making our AI safety solution accessible to the wider world. Learn about this new integration and how you can ensure you're integrating safe AI into your platform.
Generative AI
How will GenAI Handle the 2024 Elections and Summer Olympics?
AI-generated misinformation is spreading faster than ever. How can companies handle this threat during world events like the 2024 Paris Olympics?
Generative AI
These Are the Top Generative AI Dangers to Watch For in 2024
Over the past year, we’ve learned a lot about GenAI risks, including bad actor tactics, foundation model loopholes, and how their convergence allows harmful content creation and distribution - at scale. Here are the top GenAI risks we are concerned with in 2024.