Coursework for TIE-22306 Data-Intensive Programming
MySportShop is a sports gear retailer. All the sales happens online in their webstore. Examples of their
products are different game jerseys and sport watches.
The webstore has an Apache web server for the incoming
HTTP requests. The web server logs all traffic to a log file.
Using these logs, one can study the browsing behavior of the
The sales data of MySportShop is in PostrgreSQL, which is
a relational database. Among other things, the database
has a table order_items containing data of all sales
events of the shop.
Based on the data answer to the following questions:
- What are the top-10 best selling products in terms of total sales?
- What are the top-10 browsed products?
- What anomaly is there between these two?
- What are the most popular browsing hours?
Since the managers of the company donít use Hadoop but a RDBMS,
all the data must be transferred to PostgreSQL. Therefore, the detailed tasks are
- Transfer Apache logs (with Apache Flume) to the HDFS
- Compute the frequencies of viewing of different products
using MapReduce (Question 2)
- Compute the viewing hour data with MapReduce (Q4)
- Transfer the results (with Apache Sqoop) to PostgreSQL
- Find answer to the questions in PostgreSQL using SQL (Q1-4)
Environment: three options
- Recommended option: You can use your own computer by installing
- We offer you a virtual machine, which has been installed all
required software and data
- Help is available in the weekly exercises
- We offer you a virtual machine from TUT cloud
- All required software and data is installed
- No graphical user interface
- Guidance available in the weekly exercises
- Send an email to the lecturer, if you choose this option
- Own installation/cloud service can be used
The work is done in groups of three
Instructions for returning the work are the following (it is enough if only one person from each group return the work).
I assume that all the Java code and Flume configuration file is part of the IntelliJ IDEA project, that is they are
in ~/IdeaProjects directory. If you have used some other IDE, just return the working directory packed with tar (I'll contact you, if
I encounter problems). The deadline for the work is Oct 14.
- Open a new terminal in your DIP virtual machine
- Run the following command in the terminal (change NNNNNN to your student number)
% tar zcvf NNNNNN.tar.gz IdeaProjects
- Open the web browser and navigate to http://webmail.student.tut.fi/
- Login with your credentials
- Send an email to firstname.lastname@example.org with subject ‚ÄĚDIP COURSE WORK‚ÄĚ
- The body of the email must contain the answers to the questions with the SQL statements you used.
- Attach NNNNNN.tar.gz to the email.
This page was last updated on August 31, 2016.