Author: Steve Reeves, Director, Social Intelligence 

Data analysts across all mediums tend to have at least one thing in common. They’re really good at separating the proverbial wheat from the chaff – or, if you’re gluten-free like me, the rice from the stalk. The bottom line, as it relates to the quality of an analyst’s output, is that it’s directly related to the quality of the dataset being used.

Depending on who you ask, the ratio of noise to quality within the data gathered from social networks can range quite a bit, but analysts who work with it on a daily basis would likely say that 3-5% is considered useful, while the rest is garbage -- spam, duplicates, and promotional material or other. Why is it important to quantify the size of the fully cleansed and normalized data sets? Because even though 95% is classified unusable, the remaining balance often represents a sample size far greater than traditional research channels. For example, in Alzheimer’s, there are over 1.5 million relevant discussions that have occurred just in the past 12 months alone. That figure does not include the data that’s already been filtered out, including promotional material, spam, duplicates and general off-topic posts -- it’s a fresh database on Alzheimer’s, ready for rich contextual analysis.

Since data quality is the number one focus for analysts, we can both understand and appreciate the “proceed with caution” mentality traditional market researchers have towards social media data. So given that there is enough data on social media channels to perform quality analysis, there are two other questions that traditional market researchers tend to ask:

1. What is the acceptable base number of data points to validate trends? (It actually varies based on the size of the audience -- i.e., rare vs. common diseases). As an example, we don’t need 1.5 million relevant discussions on a rare disease to quantify and validate trends. We do, however, need a significantly higher base population in common diseases like diabetes or breast cancer to statistically validate what we’re seeing.

2. Is the data biased? There are biases in all data sets, including traditional market research. We can greatly reduce the likelihood of bias within social media by utilizing inference and weighting techniques as well as randomization to get equal and representative samples of your various segments. It also helps to use custom segmentations, where we’re honing in on self-identifiers, i.e. filtering by people who disclose that they live a certain type of lifestyle, or identify with a particular group -- moms, for example. 


As a starting point though, it’s important to know how much data we have to work with.

For our clients’ convenience, DRG has built a Social Data Index which outlines the general size of the usable data universe across a variety of conditions, which we’ve built repositories for and that are ready for analysis. Note, these numbers are representative of disease or condition-level discussions, and do not necessarily take into account the mention of specific brands or products, which would increase the size of the below datasets significantly.


Here is a snapshot of the size of cleansed datasets across a number of conditions we find our clients asking about:


Of the above conditions, (to which we will add in the coming weeks) many can be further segmented to focus in on audiences, product attributes, behaviors, attitudes, emotions and other valuable segments. 



Social media data contains answers to questions we as marketers and researchers didn’t even know we should be asking. That’s really the value of social, the richness of insight it provides into the mindsets of your audience. With a sound process for culling the data, and a methodology that moves far beyond traditional social listening, you then have the ability to create new value for your organizations and ultimately, better serve your customers.

For more info about the DRG Social Data Index and our social intelligence offering, contact

U.S. payers are warming up to covering digital therapeutics

View Now