Coursework for TIE-22306 Data-Intensive Programming

Background Story

MySportShop is a sports gear retailer. All the sales happens online in their webstore. Examples of their products are different game jerseys and sport watches.

The webstore has an Apache web server for the incoming HTTP requests. The web server logs all traffic to a log file. Using these logs, one can study the browsing behavior of the users.

The sales data of MySportShop is in PostrgreSQL, which is a relational database. Among other things, the database has a table order_items containing data of all sales events of the shop.


Questions

Based on the data answer to the following questions:

Guidance

Since the managers of the company dont use Hadoop but a RDBMS, all the data must be transferred to PostgreSQL. Therefore, the detailed tasks are

Environment: three options


Groups

The work is done in groups of three


Material


Returning

Instructions for returning the work are the following (it is enough if only one person from each group return the work). I assume that all the Java code and Flume configuration file is part of the IntelliJ IDEA project, that is they are in ~/IdeaProjects directory. If you have used some other IDE, just return the working directory packed with tar (I'll contact you, if I encounter problems). The deadline for the work is Oct 14.
  1. Open a new terminal in your DIP virtual machine
  2. Run the following command in the terminal (change NNNNNN to your student number)
  3. % tar zcvf NNNNNN.tar.gz IdeaProjects
  4. Open the web browser and navigate to http://webmail.student.tut.fi/
  5. Login with your credentials
  6. Send an email to timo.aaltonen@tut.fi with subject ”DIP COURSE WORK”
  7. The body of the email must contain the answers to the questions with the SQL statements you used.
  8. Attach NNNNNN.tar.gz to the email.


This page was last updated on August 31, 2016.