Using Data Science to help Women make Contraceptive Choices
By Krittika Krishnan
20 - 26 minutes
Krittika Krishnan was an Insight Fellow in Summer 2018 and is now a data scientist at CVS. Before Insight, she received her PhD in Behavioral Neuroscience from the University of Texas at Austin, where she studied the impact of endocrine disrupting chemicals (found in most plastics & cosmetics) and how their effects are inherited from one generation to the next.
Have you, or someone you know, spent hours worrying about what contraceptive would work best for you, and wondered what their side effects are? Have you scrolled through countless online forums trying to understand what other women experience on their contraceptives? I’ve experienced this myself, so during the 4-week Insight Data Science program, I decided to build a tool that could help women make these informed decisions! I learnt so much from my fellow Boston Insight Fellows and Program Directors, and could not have built this app without them!
Suggest a contraceptive to women based on what other women like them use
Analyze how other women feel about the side effects associated with each contraceptive and
Display the most talked-about (popular) topics related to that contraceptive
Let’s break down what went into these three parts.
My goal was to be able to suggest a contraceptive to women based on what other women like them were using. To do this, I used several years’ worth of data from the National Survey of Family Growth. The survey was conducted in batches, with each batch spanning 3–4 years’ worth of collected data (2006 to 2015). This national survey contains reported information from women, including information on their demographics and the contraceptive that they were using at the time. Once I combined the data from each survey was combined into one dataframe in pandas, I took a few steps to clean it up.
Changed the names of the contraceptive to match across batches of collection. For example “Implant” and “Nexplanon and Implanon” were simply changed to “Implant”
Dropping rows with NAs in various fields
After cleaning, I was left with data from ~12,000 women.
I decided to start with some exploratory data analysis (EDA). Below are some plots looking at the number of users of each type of contraceptive, and histograms of each contraceptive type by age group.
Clearly, some of the contraceptives are more popular and used more often than others, leading to imbalance between classes. We can try to address this in the modeling portion of the project.
Some of the features included age, ethnicity, marital status, number of partners, etc. I thought long and hard about the tool that I wanted to build, and knew that I would not be able to recommend a contraceptive unless the woman using my tool entered in all the necessary information, and asking for number of sexual partners would be too invasive. For this reason, I chose to include only three features in my model: Age, Ethnicity* and Marital Status. Since there were several types of marital status (“Separated”, “Married”, “Living together”, “Widowed”, etc.), I grouped them to three levels that encapsulated all these types of relationships — Single, Married or In a Relationship. In addition, Ethnicity was one-hot encoded using sklearn preprocessing.
We’re ready for some modeling!
*One unfortunate limitation from using this data set was that there were only three ethnicities — White, Black and Hispanic. Either only women of these ethnicities were included, or women only had these three ethnicities to choose from in the original data collection from the surveys.
Initially, I tried to create a model that would predict the most likely contraceptive for a woman, regardless of type. Then, with some valuable feedback from my Insight Technical Advisor, I split the contraceptives into two subgroups — semi-permanent or impermanent — and created two models for each subgroup. Each subgroup had four contraceptives as predictive classes, and the data was randomly split into 80% training and 20% test set using sklearn’s model selection.
Since this is a multi-class classification problem, I began with a Gaussian Naïve Bayes model rather than a logistic regression, which tends to do better in binary classification scenarios. I decided to start without any over- or under-sampling to see how a baseline model would do.
For both subgroups, the GNB model had a slightly higher accuracy than random (0.60), which in this case would be predicting the majority class every time (0.57). However, the GNB model assumes total independence between features — here, we know that age and marital status would surely be correlated. To take the feature correlations into account, the next model I used was a Random Forest (RF), which performed slightly better than the GNB (0.65). I chose to implement the RF in the product as it would also scale easily as we add more features. Below, you’ll see an example of the confusion matrix for the semipermanent group.
Clearly, the classes are imbalanced. To try to mitigate this issue, I used Synthetic Minority Over-Sampling (SMOTE from imbalanced-learn API imblearn; image below) to generate some data points in the minority classes to help train the model better. SMOTE works by generating data based on k-neighbors in the minority class. I considered under-sampling the majority class, but realized that with my already limited data set, I would be training my model on even less data. As was expected, the accuracy of the test set dropped, with no improvement in recall.
You’re probably asking, why did I use accuracy as a metric to measure the performance of my model? I wanted the model’s predictions to reflect the true nature of the distribution of contraceptive use among women. Unlike binary classification scenarios such as fraud detection and identifying breast cancer, I didn’t want my model optimized for it’s precision or recall on a particular class. In this case, the model’s accuracy best captured its overall performance across all classes. Because this was my goal, I chose to keep my original Random Forest model without any oversampling.
In addition, I noticed that the accuracy of the test set (0.65) was almost exactly that of the training set (0.66), indicating that I needed more predictive features for the model to improve in accuracy.
In the future, with more time and resources at hand, I would look for other sources of data that could potentially be more predictive of contraceptive choice, such as geographic proximity to clinics, income level, etc. I received feedback that it might be useful to incorporate exclusions for those with certain medical conditions — this would be great to incorporate, but I would have to be careful with striking a balance between a suggestive tool and a medical recommendation. I could also spend time tuning additional hyperparameters of the random forest such as tree depth, minimum sample at a leaf, etc, which would be more effective with more predictive features.
To understand how women felt about the various types of birth control, I scraped r/BirthControl for two years’ worth of posts, using the pushshift.io Reddit API. This resulted in ~21,000 posts. From the body of each post, I used a variety of tools to clean up the text.
Remove stop words. To ensure that the sentiment around a contraceptive remained accurate, I kept “negation” words that are in the set of English stopwords from the Natural Language Toolkit (NLTK) corpus.
Standardized text by removing URLs and another unnecessary characters, but keeping punctuation.
3. Tokenized (separated) sentences from one another using sent_tokenize from NLTK.
4. Stemmed words to remove their suffixes so that the same words do not get counted multiple times (i.e. pills becomes pill).
Once I had all the sentences separated out, I specifically wanted to look at sentences that mentioned the eight forms of birth control used in the predictive model above. I extracted the sentences that only mentioned the above contraceptives, and subsetted them into their respective pandas dataframes (i.e. all sentences mentioning “pill” go into the “pill” dataframe). To keep the sentiment analysis as accurate as possible, I only extracted sentences that mentioned the contraceptive of interest. So, a sentence mentioning the pill and the IUD wouldn’t be considered in the pill OR IUD dataframe. Although this reduces the amount of data I have to work with, it keeps the analysis of the sentiment of each sentence as accurate as possible.
I created a list of potential side effects based on listed ones from the FDA website, and others that women had mentioned to me. Using this list, I created a new pandas dataframe with each sentence and whether or not it mentioned a particular side effect (1 or 0). You can think of this as a customized “bag of words” model, counting the number of times a side effect was mentioned in particular sentence about a contraceptive.
I applied the NLTK Sentiment Intensity Analyzer (SIA) from the VADER (Valence Aware Dictionary and sEntiment Reasoner) package to each sentence. This sentiment analyzer has been pre-trained on social media data, and uses a lexicon-based approach to take into account not only the type of sentiment (negative or positive) but also the intensity of the sentiment expressed, via the “polarity score”. There’s a great article on how this SIA was built and how it works here.
Here’s an example of the code and the resulting data set.
I was able to:
Calculate the frequency of side effect mentions relative to all sentences mentioning a particular contraceptive
Multiply the “polarity score” by the the specific side effect that was mentioned in the sentence, for each contraceptive.
This resulted in my final dataframe, below:
From this, I was able to build the second part of my app using Plotly in Dash.
One of the Insight Program Directors gave me valuable feedback that my sentiment analyzer would be useless without a validation. He suggested that I ask other Insight fellows to look at the same sentences and rate them as either positive, negative or neutral. To do this, I first grouped VADER’s polarity scores into three subsets:
Polarity score > 0.1 = 1 (positive)
Polarity score < 0.1 = -1 (negative)
Polarity score between -0.1 and 0.1 = 0 (neutral)
Based on ~200 sentences rated by both VADER other Insight fellows, I was able to develop the confusion matrix below:
Clearly, VADER is doing slightly better than chance at correctly classifying a sentence as positive, negative or neutral, which is still valuable for women who have previously had no information on how other women feel about each contraceptive and their side effects.
There are several ways I can think of to improve this section of the product:
Use a form of named-entity recognition to identify side-effects to get a comprehensive look at all the side-effects experienced.
Build a sentiment intensity analyzer from scratch. This would have to start with asking women to label sentences as positive, negative or neutral. With a sufficient number of these labeled sentences, I could train a basic sentiment analyzer using a classification model.
While the first two aspects of the product give women an idea of what contraceptive might work for them and how other women feel, they might still spend hours on online forums trying to get a better sense of what other women are saying about these contraceptives. To help avoid this, I decided to use topic modeling to provide some relevant Reddit posts for women to read while in the app itself.
Topic modeling is a statistical method used to discover abstract “topics” underlying corpuses of text. In this case, specific topic modeling techniques can use the distribution of words within a reddit post and across all reddit posts to create topics, and attribute the particular reddit post to a particular topic. There are some great articles describing topic modeling techniques (here, here and here), but I’ll briefly describe the one that I chose — Latent Dirichlet Allocation (LDA).
LDA is a probabilistic model, which essentially means it gives a probability distribution for an event. Conversely, a deterministic model would give a single outcome for an event. In the case of topic modeling, LDA consists of two matrices (shown below) — the probability of a selecting a word in a topic, and the probability of selecting a topic in a reddit post. It uses the frequency of a word’s appearance within and across posts to create topics.
For example, let’s say I had three very simplified posts:
“Dog cat bananas”
“Bananas bananas bananas”
“Cat cat cat”
The LDA would note the probability of word occurrences within and across the three posts, and create topics based on these distributions. Here, Topic 1 could be “dogs”, Topic 2 could be “cat” and Topic 3 could be “bananas”. Then, LDA would assign Post 1 as having equal probability of coming from Topic 1, 2 or 3, while the other two posts almost certainly come from Topic 2 or Topic 3 respectively.
For this portion of the webapp, I needed to understand what the most popular topic of discussion was among women, for each contraceptive. In addition, I needed to display the posts most relevant to this particular topic. Using LDA helped me extract the most popular topic for each contraceptive, and the top 3 posts associated with that topic. The steps are below.
Create a bag of words model using sklearn CountVectorizer
2. Determine the number of topics (since LDA can’t automatically do this itself), and run the algorithm from sklearn.decomposition
3. Displaying the resulting topics, specifying the number of posts
This allowed me to display relevant posts from Reddit on my app:
Although Non-negative Matrix Factorization is also a great topic modeling tool, when I tried implementing it, it didn’t produce comprehensible topics in this particular context.
Unfortunately, there are a few downsides to using topic modeling. First, you have to specify k number of topics. How did I choose 10 as the number of topics? To be quite honest, I tried a variety of topic numbers, and 10 yielded topics that were neither too broad nor too specific. Second, you have to label the topics — an example of the most popular topic that I extracted from posts related to IUDs was “just period days cramps got”. It makes very little semantic sense, but we can tell it’s related to the time frame of periods and the cramping side effect. Labels would help us verify if our topic modeling built topics and grouped posts into those topics correctly.
I would love to build a feedback feature in my app that would allow women to label topics, or rate topics on whether or not they made sense to them.
In short, my hope was to (in 4 weeks) create an app that would not only suggest a contraceptive to women, but also analyze other women’s sentiments around their side effects, and provide them with relevant information on what other women were saying. You can check out the whole webapp here. I am really excited to use data science and machine learning to create a tool like this, at scale, that can hopefully help women make important decisions. I learnt a lot creating this webapp, and it wouldn’t have been possible without the help of the amazing Boston Insight Fellows and Program Directors!!
Feel free to comment if you have any questions or suggestions — I would love to hear your feedback!