Spring 2018 | Course Link
Offered by the Human Computer Interaction Institute, the course covers techniques and technologies for creating data driven interfaces. Students learn about the entire data pipeline from sensing to cleaning data to different forms of analysis and computation. Various of topics about data visualization are addressed, such as perceptual issues, design language, and narrative.
Group of 2 people | 3 weeks
My role: visualization, tableau and d3.js
We used scatter plot on map, sankey diagram and stream diagram to explore wildfire in US. We also used Naive Bayes and built a prediction model based one selected features. From this story, we want to bring awareness to the issue of wildfires and hope to further the education of these types of disasters. The final deliverable is a website available here.
The data we use here originally comes from Forest Service Research Data Archive. We first observed what features that affect wildfires are included in the dataset. Then we proposed several ways to visually explore it:
In order to have a overall concept of the wildfire in the US, we plot the location, size and season of the wildfires occured in the U.S from 1992 to 2015 on the map below:
We found many interesting facts about location, season, scale and even missing data. After having an overall concept, we dived into more specific questions:
The total quantity of occurences keep stable. If we don't consider miscellaneous or undocumented causes, lightning is one of the primary natural causes of wildfires in the United States. Among human causes, arson and debris burning make up a good chunk of occurences compared to other human induced wildfires.
It is worth noting and sudden increases in the wildfire occurences for certain causes. For example, equipment use caused a relatively consistent number of fires up until 2013, with roughly 1000-2000 occurences a year. However, in 2014, the number of reported wildfires doubled, reaching 4472 occurences. We suspect that this is due to a classification or reporting issue. Interestingly enough, the number of miscellaneous caused and undefined caused wildfires decreased.
From this chart, it is not easy to see patterns with regard to how the size of a fire relates to the location of the occurence. However, it is interesting to see that many small fires occur in Georgia, California, and Texas - all of which are generally dry and hot. The data used for this graph is of all the years combined.
Overall, visualizing three charts above gives us a more intuitive sense of wildfires in general. To continue the story, we have built a naive bayes predictive model to see how well we could classify fire size given a set of features. (mainly my teammate's contribution, I had not learned machine learning yet at that semester)
Individual Work | 2 Weeks
Tasks: scraping, exploration, preparing, visualization
A Wikipedia article may have many different language versions, so it raises my interests in what’s the relationship among different language versions of Wikipedia articles? This is an individual assignment and the full report can be found here.
The inspiration of this question comes from the global language network (GLN) by MIT media lab. As the screenshot shows below, languages are connected according to the frequency of book translations. Node sizes represent the number of native and non-native speakers of a language and edge thickness represents the number of translations from one language to another.
Therefore, what I want to do is actually a mini version of the GLN that shows the relationships of language versions only based on articles from Wikipedia.
The chord diagram on the left below shows the relationship of the top 13 languages used in Wikipedia based on their existence in the 1000 most essential articles. We may find the multilingual completeness of essential articles in Wikipedia is quite good.
The next step should be going beyond the 1000 most essential articles and enlarge the samples to more and more articles. However, I eventually realized that finding the translate-to/from relationships among more articles on wikipedia can not be done within 2 weeks as a python and d3.js beginner. Therefore, I find it would be also interesting to apply this visualization to the original data set of GLN, in which the translate-to/from relationships are well-documented.
The choid diagram on top right depicts the relationship of the top 52 languages (by population) in the GLN. Compared with original visualization, it presents the quantitative relationship in more noticeable proportion. The drawback of this visualization is that more kinds of languages you put into this comparison, less legible the diagram is. This problem can be solved by adding interactive filters so the reader can choose the languages they want to compare.
There were three resources I tried: API, downloaded XML, web-scraping. Here was no API directly returning the information of language versions about an article. Extracting from KML downloaded is ineffective: it took too much time to download; most of the articlecontent is not useful for my study. Web-scraping was my baseline, but before that, I need to set a range of what article I want to scrap because because scraping all articles were not feasible.
Luckily, I found a "List of articles every Wikipedia should have". Therefore I chose to focus my study on this range (1000 articles) and start my data collection from here and.
I started to check the data by counting how much articles each language have among this 1000 goal. For example, English has all 1000 when Tibetan only has 360.
From the plot shown above, It’s obvious to see that most frequently used languages have fulfill the goal. When we exam some languages that are less frequently used, we find a trend of missing articles.
From the diagram of top 13 Languageson(left), we can intuitively find that multilingual transition among these languages are quite good. Then, we expand the data source of the plot into top 14-46 language (right), the same pattern are identified. However, the chord diagram is no longer suitable because the rim of the circle will be cut into to too many invisible segments.
The major limitation of current analysis is the coverage of data sample. Essential articles are made to be essential for different languages so it’s unlikely to be incomplete in terms of language version. If I enlarge my sample to cover much more articles, then the relationship between languages will be more interesting.
Therefore I applied this visualization to re-draw GLN. Even we missed some information about minor languages, we got a much more intuitive impression of relationships among most used 52 languages in the world. The color of the ribbon connecting sectors at two ends depends on the which one represents the language that serves more as translation resource.
Fall 2018, Audit, Individual Assignments | Course Link
Course from the School of Computer Science, introduction to basic concepts such as information theory, decision trees, regression, neural networks, Bayesian learning, reinforcement learning, EM algorithm, SVM, etc.
I followed around 80% of the course content and assignments. Followings are some examples of my homework:
Fall 2018, Individual Assignments | Course Link
R is another language popular for data related tasks. I took this course as a starting point of getting familiar with the concepts of data mining and powerful visualization in R and packages like ggplot2.
Spring 2018 | Course Link
This course invites students to creating connected products. Topics explored will include awareness, real-time sensing and communication, embedded intelligence, and designing experiences for the internet of things. Students apply this learning to realize a prototype connected device.
Different products sense the data and together create an ambient IoT system that could be given to a child to help them enjoy their go-to-bed time and fall asleep. The website available here.
Spring 2018, Audit
Data and participative observation are two hands of doing user research in a digital scenario. By combining insights obtained from both resources, we can form a better understanding of customer behavior and online community culture.
Rental fashion is emerging these year as a part of shared economy. I audited the course, partially participated in a group who focus on rental fashion experience and helped with some data scraping and processing works. The website available here.