hipages Data Engineer Tech Challenge

Welcome to the hipages Data Engineer Tech Challenge! The purpose of this challenge is to help us assess your technical skills. We know that you have limited time to devote to this task and may not be able to provide the complete solution as you would if given more time, so we suggest you focus on the core requirements first, then any additional features if you have time left over.

The Task

Your task is to write an ETL program to read json data and perform a series of transformations on it to build a structured table as result.

Context:

The various hipages applications send their json events to a receiver which dumps these events to disk in form of json lines. An example of such a file is ./source_event_data.json in this repository, representing one event per line.
These events are fired on the hipages website when certain “Actions” are performed by “Users”. The file also contains metadata information about the url they were visiting, the time and other related information as described in this json schema file - ./source_data_schema.json.

Guide & Tips:

The team is interested in the approach you follow for solving this problem
Approach this problem as you would to build an end to end ETL pipeline that reads from a sources such as distributed file system or a message queue, and then deposits this data for analytics into a data lake.
Keep in mind that data in production almost always has anomalies, such as schema violations, incomplete rows, corruped data etc. A robust pipeline should account for and gracefully handle such situations
Most of all this is your time to shine, and show us your engineering prowess, and to demonstrate that you approach this problem like a Data Engineer working with large amounts of data
Think about how you would test the system, and what you think would be required before it could be moved into production
You can use Java / Python / Scala.
Think of the future iterations on your solutions and how easily it can evolve and scale
Think about the execution environment and how portable your solution is, eg. whether you choose to build a distributed docker container or use a framework like Spark / Apache beam etc. Document your design choices
We have provided some scope but if you find that it’s too broad, you are welcome to limit the scope, just let us know the thought process and assumptions behind your decisions
Do not forget to add instructions for the evaluation team on how to read and execute your code
You can use a publicly hosted version control like github.com to house the new project you are building, and send us this repository as your submission

Requirements:

Your ETL program needs to consume this data and produce an output in the form of two tables (CSV):

Field	Description
user_id	identifier for the user
time_stamp	UTC timestamp of when the event was fired
url_level1	first piece of the url (ie. the domain from the url where this action took place)
url_level2	second level of the url path
url_level3	third level of the url path
activity	activity that took place

Field	Description
time_bucket	hourly time granularity
url_level1	first piece of the url (ie. the domain from the url where this action took place)
activity	activity that took place
activity_count	count of events for the activity
user_count	unique number of users that performed the activity

Example results:

user_id	time_stamp	url_level1	url_level2	url_level3	activity
564561	02/02/2017 20:22:00	hipages.com	articles	NULL	page_view
564561	02/02/2017 20:23:00	hipages.com	connect	sfobusiness	button_click

time_bucket	url_level1	url_level2	activity	activity_count	user_count
2017020220	hipages.com	articles	page_view	4	1
2017020220	hipages.com	connect	button_click	2	1

Bonus: Think of ways you can transform the source data into another format to expose other types of information. Explain why you chose that transformation and what types of analyses it could be used for?

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
source_data_schema.json		source_data_schema.json
source_event_data.json		source_event_data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hipages Data Engineer Tech Challenge

The Task

Context:

Guide & Tips:

Requirements:

About

Releases

Packages

Contributors 4

hipages/tech-test-data-engineer

Folders and files

Latest commit

History

Repository files navigation

hipages Data Engineer Tech Challenge

The Task

Context:

Guide & Tips:

Requirements:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages