Learn Web Scraping To Get Insight — How Does Sun Exposure Affect Our Mental Health?

Intent

5 min readMay 18, 2021

--

A few days ago a colleague of mine asked about getting data from Wikipedia in a sense of scraping. I remember doing it once using Puppeteer in Node.js to get people’s data from their online profile and check whether it fits or matches certain criteria that were formulated. But I never actually do it with Python.

As it provoked me to get my hands-on the scraping process, along the way I found that I could as well try to make a conclusion the data result. So, shall we?

I will be scraping data from Wikipedia and WHO. And using BeautifulSoup as the scraper. This article will cover the whole scraping process and little about the data analysis (since I am not very fond of the process).

Hypothesis

The country with more sunlight hours will have fewer mental health problems. (Side note: On Schizophrenia & Suicide occurrence)

Photo by Steve Johnson on Unsplash

Code & Scraping Process

In our virtualenv as usual, we will install these packages. But just a small disclaimer, I am not 100% familiar with all the packages’ features, so please do simplify the usage as per your need if they double in function. Then import in your .py file.

We will init the bs4 as per the docs.

Line 1–3: We first hit the webpage with urlopen, then we use the find_all method on the html returned from urlopen.

Line 4: We will be using lots of find_all in this tutorial. What we want to find is this table below with its table HTML tag. Thus we use soup.find_all(‘table’).

We can try printing the result, it will get us the HTML part of our designated table .

Line 11: We will try to get only the content using a loop on the table.

Line 12: We find_all <tr> tag aka row (refer to the image below if you are not familiar with HTML tags)

Line 15: And on each row, we want to find_all td.

Line 19, 21, 25: Each data we will append into an array so that we can use it for creating DataFrames with pandas

HTML tags for table

Now, let’s create our first dataframe like this:

df result from scraping the schizophrenia table

Ok, we successfully scrape schizophrenia data from wiki! 🎉

Next, we will take another data from wiki that is Sunshine Duration (links on reference). We will repeat the same steps from find_all table, we will loop the rows, and get the cell.

Line 14–27: The difference will only be in the data processing, as the Sunshine Duration data is given per city, we must categorize and count (Line 25) the total as a country and find the avg.

Then we will learn to combine both data, into a new DataFrame, since we already have 2 data frames, so we shall now able to see the plot and gain insights (fingers crossed! 🤞🏻)

Line 8: Join the data frame together with pandas join

Line 9: df.dropna(inplace=True) to remove all null data as we only need complete data for this time

Line 11: Print the correlation matrix for our conclusions

Line 12: I extracted the combined data to .csv first

Line 21–26: We can try to find the linear regression

Line 27–29: Visualize the data with seaborn scatter plot, and draw the regression line with matplotlib

Line 31: I wanted to see the Least Square plot, but you can skip this one.

That’s it! We have successfully scraped Wikipedia and learn to make plots from the data we scrape! 🎊 🥳

Result and Discussions

Schizophrenia

From the correlation matrix, we see that there is a correlation between total sunshine hrs/year and the DALY rate of Schizophrenia.

correlation matrix for schizophrenia

But as we can see from the regression, although the trend is actually dropping, the data is very scattered so it cannot be concluded that on places with the longer sun exposure, the DALY rate will be lower.

Suicide Occurrence

I also did the same process on suicide data of 2019 from WHO (links on reference) and using Heatmap to plot the correlation matrix. As we can see, the correlation coefficient is -0.34, meaning that there is no correlation between Sun exposure and suicide rate.

Suggestions and Thoughts

I enjoyed the whole learning process although the results were not as expected. Few other things to note are many factors contributing to schizophrenia and suicide, such as health centers' availability, alcohol, stress (traffic, family), personal wealth, etc. So we do need more variables and not only rely upon sun exposure.

If you are interested in gaining further insights, I would propose this for suggestions:

  1. Group the data per continent (and delete the outliers), and regress each continent to find interesting trends
  2. As a disease grow exponentially, maybe we could try using y log on the y axis and rank on the x axis and see how the trend will be?

️ ️️☀️ But hey, at least we learn something new today! ☀️

--

--

Paulina
Paulina

Written by Paulina

I write things that make me happy.

No responses yet