“We were able to form a model to predict the personality of every single adult in the United States of America.” — Alexander Nix, former CEO of Cambridge Analytica
Clicking ‘Like’ is something most of us do without thinking. It’s a form of social currency that gives us a momentary jolt and revs up our dopamine centers. Yet who knew all our ‘Likes’ could predict our personality and be turned into a tool of political persuasion against us?
What follows next is:
- A cautionary tale of how big data can be abused in the hands of bad actors.
- A step-by-step spreadsheet tutorial that explains a personality predicting algorithm, LASSO regression.
- And a warning about the digital footprints you leave across the web.
Nearly 2 years after the election that shocked the world, the Facebook scandal with Cambridge Analytica (“CA”) remains in the eye of the media firestorm. And it seems like every juicy angle of the story’s been covered.
But 1 question continues to go unaddressed:
How do these models work??
Not on some surface level (a correlation of Facebook Pages you like to personality quiz scores). But under the hood. In a language most people can understand…good ‘ol Mr. Excel.
As a self-confessed data nerd and spreadsheet activist, I sought to fill my curiosity appetite and was surprised at the simplicity of what I found.
With a general linear regression model called LASSO regression, I’ll use spreadsheets to show you how machine learning can predict your personality better than your family with only 125 of your Facebook Likes. Sounds a bit dystopian, but true.
Your data is valuable and you deserve to know how it‘s being used.
Parts 1–3 (~10 minutes) covers WHAT was done and gives you background on the scandal. You’ll learn the practice of micro-targeting (delivering personalized ads to you based on your personality profile) and how 87 million Facebook profiles were harvested.
Parts 4–5 (~15 minutes) covers HOW it was done — the data science. Skip to here if you’re familiar with the scandal and just want the tutorial. Part 5 is where we get under the hood and look at the details.
The model is best viewed in Excel, but you can also view online in Google Sheets (calcs are slower) and there’s a PDF with the math (part 5 of post) for easy offline reference.
Like any super power, machine learning will be used by villains (Cambridge Analytica) and heroes alike (Crisis Text Line). Proceed with caution.
If you enjoy this post and want to get other free spreadsheet tutorials that help you understand things like facial recognition or how your Netflix recommendations are generated, sign up for my email list here:
Cambridge Analytica, the data-driven analytics firm that specialized in psychographic profiling or “PSYOPS,” was hired by the Trump campaign in June of 2016 to lead its digital ops up to the November election and has boasted they have up to 5,000 data points on every adult in the US.
In fact, they list this as an achievement on their website (along with bankruptcy notices as of May 2018).
In the months leading up to the election, CA poured their treasure trove of data into their personality prediction machine to build 10,000 highly personalized Facebook ads targeted to different audiences in an effort to manipulate the voter behavior of undecided voters and suppress voter turnout.
Ken Bone anyone?
They call this advertising practice ‘micro-targeting.’ The former CA whiz-kid data scientist and whistleblower who helped build the tools, Christopher Wiley, calls it:
Your voter records, where you shop, your tweets, your geo-location phone data, what church you go to, the TV shows you watch, your Facebook Likes. They have it.
Especially your Likes.
And with enough of your Likes, researchers at the forefront of the field of pyschographic profiling proved in 2014 that computers are more accurate at predicting your personality than the people closest to you.
- With just 9 Facebook Likes, a computer can predict your personality as well as a colleague.
- With 65 Likes, as well as a friend.
- And with 125 Likes, your family.
According to the Cambridge researcher who ended up harvesting the Facebook data used in CA’s micro-targeting operation, it was possible to learn such a machine learning model from scratch in just 1 week:
Data’s far more valuable than models because if you have the data it’s very easy to build models — because models use just a few well understood statistical techniques to make them. I was able to go from not doing machine learning to knowing what I need to know in one week. That’s all it took. — Aleksandr Kogan
He was right.
In the grand scheme of things, the model (a sort of correlation between your Likes and personality quiz scores) is the easy part.
Before discussing how psychological profiling was used and how 87 million Facebook profiles were harvested, there’s a few things I want to to get off my chest…
- Data science is not a panacea. It alone can’t rig an election or make a good candidate bad. However, when elections are decided by razor-thin margins, there’s no doubt that technology can play a role.
- The effectiveness of micro-targeting in politics is widely debated, but we don’t know how well it worked for the Trump campaign. In the aftermath of the election, key players from all sides of the scandal (Trump campaign, CA, Ted Cruz’s campaign, Aleksandr Kogan) have made public claims saying CA’s models were worthless snake oil, the best thing since sliced bread, or not used at all. Other skeptics have debated the effectiveness of micro-targeting altogether, while the researchers in this field say otherwise. Given that CA’s models aren’t in the public domain (thankfully), I won’t speculate on how much they did or didn’t help Trump win. Just keep in mind that most sides are incentivized to either distance themselves or take credit.
3.The root of this scandal was consent. Prior to 2014, Facebook’s developer policy allowed 3rd party apps to collect data from Facebook users’ friends without their friends’ consent. So even though only ~300,000 people took an online personality quiz and gave consent…this exposed their friends’ data as well (who didn’t explicitly consent)…87 million people. The data was subsequently shared with CA (violation of Facebook’s policy). There was no hack and the data is forever out in the wild.
4. Your digital footprint is growing. Machines are getting smarter. And politics is a high stakes game. This isn’t the last time you’ll hear about bad actors using your data for nefarious means (sigh), but it’s important to be informed so you can make your own choices on data privacy. Data brokers don’t discriminate against political party and in some cases, you can see the data these orgs have on you. Organizations like CA are already being used for the 2020 election in the US so this isn’t the last time you’ll hear about this.
Now that I got that off my chest, let’s look at how psychological profiling works.
Pyschometrics is a scientific attempt to measure someone’s personality and it’s been around since the 1980s.
Simply put, people take a Q&A personality quiz and they are scored on the standard “Big Five” Personality traits (OCEAN).
And once you know someone’s personality, you can segment them into different advertising segments and delivered tailor-made persuasive ads that cater to their individual tastes.
Even though psychographic profiling has been around for decades, the problem has always been data collection.
Since most personality quizzes involved 100+ questions (example below) and take between 20–30 minutes to complete…it’s never been easy to collect data on a wide swath of people across the globe.
And then Facebook emerged in 2004.
The term ‘Big Data’ was coined in 2005.
The first iPhone was released in 2007.
And our digital footprint started to grow exponentially.
Suddenly, researchers saw an opportunity.
In 2007, 2 psychology researchers at Cambridge’s Pyschometrics Centre, created a Facebook personality quiz app (myPersonality) and it went viral. Over 6 million people used the app and nearly half of these people allowed the researchers to use their Facebook data.
Social scientists now had a goldmine of personality data they could mine for insight.
In 2013, they published groundbreaking research that showed your Facebook Likes could predict your personality and political views (among other things):
Interestingly, it showed that the best predictors of high intelligence were likes of “thunderstorms,” “The Colbert Report,” and “curly fries.” (note: correlation isn’t causation…eating curly fries won’t make you smarter Stephen haha!).
Similarly, people who liked “Hello Kitty” tended to have ‘O — Openness’ scores.
In their research, they sounded the warning bell on the implications to data privacy and ownership, but people with other agendas soon took notice.
In the summer of 2013 before he joined CA, Christopher Wiley discovered the paper and said:
“And then I came across a paper about how personality traits could be a precursor to political behaviour, and it suddenly made sense. Liberalism is correlated with high openness and low conscientiousness, and when you think of Lib Dems they’re absent-minded professors and hippies. They’re the early adopters… they’re highly open to new ideas. And it just clicked all of a sudden.”
Later in 2013, Wiley was introduced to Alexander Nix who offered him a job as research director at the British behavioral research and communications company, Strategic Communications Laboratories or “SCL,” which subsequently created the US subsidiary Cambridge Analytica.
“We’ll give you total freedom. Experiment. Come and test out all your crazy ideas.”
That wasn’t a good idea.
While doing research for SCL, Wiley got connected with Steve Bannon. And given Bannon’s interests in politics as the editor-in-chief of Breitbart and his support of Nigel Farage’s Leave the EU Brexit campaign, Chris’s ideas on political messaging intrigued Bannon.
In December, CA was formed with Bannon as its Vice President along with financial backing from Republican megadonor and AI pioneer Robert Mercer.
With their piggy bank fully stocked, CA needed to find a way to get the Facebook data and deliver on their promises of political persuasion.
Enter Alexsandr Kogan.
Kogan worked as a Cambridge University researcher in the Psychometrics Centre. The same research center where his colleagues published the 2013 paper showing the predictive power of Facebook Likes.
In early 2014, Wiley and Kogan got introduced to each other.
Wiley and CA were interested in the Facebook data Kogan’s research lab had access to from their app.
CA was willing to pay the researchers for the data; however, negotiations broke down between Kogan and his university colleagues (who had both ethical concerns and monetary disputes with Kogan) so Kogan and CA devised another plan.
Kogan offered to build his own app and harvest the data himself. CA agreed and helped him set up a separate entity called Global Science Research or “GSR.” GSR would acquire the data, share it with CA, and CA who would reimburse GSR for the costs.
To expedite the harvesting of data, GSR used a 3rd party online survey firm, Qualtrics, and crowd-sourced respondents through Amazon’s Mechanical Turk to pay users between $2 and $4 to take the personality survey.
The respondents were asked to authorize access to their Facebook profiles and Kogan’s app performed its’ sole function — collect their Facebook data and their friends’ data.
In total, over 300,000 Facebook users took the quiz (~$1 million cost to CA) which amounted to harvesting around 87 million people’s Facebook data (your Likes and other profile data like your location, your name, etc…).
Here’s a timeline I put together to show how some of the key players are linked:
So what went wrong?
- Facebook failed to read all of GSR’s app’s terms and conditions (see above pic of terms & conditions) during their app review process. They admitted this in court and GSR’s terms stated they could sell people’s data.
- Regardless of GSR’s own terms, GSR themselves ignored Facebook’s App Developer policy (which Kogan claimed not to read) which prohibited them from sharing their data with CA.
- Friends Didn’t Give Explicit Consent — Facebook’s App Developer policy (at the time) allowed 1 person to give consent to all their friends’ data.
Now that CA had their data, it was time for them to use, errr…“abuse” it.
“We exploited Facebook to harvest millions of people’s profiles. And built models to exploit what we knew about them and target their inner demons. That was the basis that the entire company was built on.” — Christopher Wiley, former data scientist at CA
Here’s a simple example of how micro-targeting works:
- ID Voters — Using voter records and other data, identify undecided voters
- ID Hot Topic — Choose a ‘hot button’ topic that’s important to voters like the 2nd amendment (gun rights)
- Tweak Ad to Fit Personality — Based on the individual’s personality profile, nuance the messaging to resonate better with that person.
- Anonymously Deliver Ads — Through a Facebook practice known as ‘dark posts’ (now banned), purchase ads anonymously and deliver to people who fit your criteria
As explained by Alexander Nix, a person with high neuroticism (’N’) and conscientiousness (‘C’), for instance, would need a message that is rational and fear-based. They’d be shown a picture of a burglar entering their home as the threat of the burglary along with the insurance policy of owning a gun is persuasive.
Conversely, a person who has a low level of Openness (‘O’) and is highly agreeable (‘A) would need message that is rooted in tradition and family. They‘d be shown a picture of a father passing values down to his son.
According to an internal PowerPoint leaked by a former CA employee, their ads were seen 1.5 billion times during the Trump campaign.
With the rise of big data, it’s no surprise that advertising has become increasingly personalized. The issue though wasn’t personalized advertising. The issue was around data privacy, a lack of transparency, and consent.
Now I’ll explain the data science behind the scandal using step-by-step spreadsheets. Yes, there’s math. But you can follow all the formulas…no coding needed.
In this section, I’ll discuss:
- An overview of the algorithms used in the personality prediction research
- An intro to LASSO regression
- How to choose the best lambda in LASSO regression
Think of Part 4 as a “bird’s eye” level view of LASSO regression; whereas, Part 5 is the “worm’s eye” view.
In Part 5, you’ll see all the step-by-step math derivations and references to the formulas in the Excel file. It’s very detailed, but quite frankly, I can’t stand tutorials that hide these details so we’ll get under the hood!