1.jpg

Visualizing the delta in the world's knowledge on daily basis

Visualizing Google Knowledge Graph

System Designer & Developer of a system collecting and visualizing aggregate data to understand Google’s Knowledge Graph

Team

Individual Project
Peter Dornbach

Mentor

My Role

Software Engineer
System Designer

Deliverable

A Map-Reduce Distributed System that collects insights on Google Knowledge Graph and displays aggregate data to many customer facing engineering teams.

Time & Skillset

Summer 2017 
System Design

User Research

Big Data

Software Development

Distributed Systems

Data Visualizations

1_64AZ80NoAO8wH1RVGToSKg-750x350.png

Creating daily analysis of the Knowledge Graph, the biggest data lake ever known, is a resource intensive time consuming process with more than 10 engineering teams as stakeholders. For this, I created a Flume - Map Reduce- pipeline on top of the daily traffic into the graph data in order to create summary tables and dashboard visualizations for the engineers to query.

My project was responsible for creating a Flume - Map Reduce- pipeline on top of the daily traffic into the graph data in order to create summary tables for the engineers to query. My team was responsible for reconciling new data entering the graph with already existing nodes to create relations and avoid duplicates.

 

I was also responsible for synthesizing dashboards with statistics about the new data entering on top of the summary tables created for the engineers to sanitize the graph and detect any anomalies.

This was an internship project in which I managed to liaison with 10 vertical managers to help curate the needed aggregate data that will help them analyze and monitor their daily data ingestion progress. This also included presenting the project to a level 7 director at google and more than 150 engineers. I was responsible for system design and implementation as well as optimization of the pipeline to run within the needed system constraints in real-time.

I believe this project was a turning point for me to understand the constraints of working with many stakeholders and for specific metrics. It was also a great technical challenge Knowledge Graph is by far the biggest data lake I have ever dealt with. It required a lot of engineering to build summary tables and queries that were smartly running within the constraints.

How did it work?