Wikipedia data engineering practices with Nuria Ruiz
February 11, 2021
I always say that I have been working with data a long time, so long that Big Data was just called "science".
After finishing college with a Physics degree in 1997 the options were really just three: teach highschool, apply for a PhD scholarship (which sounded not too bad but nothing I wanted to do just yet) and this new thing that had to do with computers and "writing computer programs" for a living. This third option seemed like a more exciting prospect than the other two and after having written a grand total of 5 Fortran programs in my college years, it sounded like something that I, maybe, could do.
At the time, there was quite a need for people that knew their way around computers and some emerging companies just had programs in which they teach you to work while you got paid. This was really a paid bootcamp at a startup but neither the term "bootcamp" nor "startup" existed quite yet.
I applied to one of those programs and started my way into the professional world writing financial and HR software. In windows. When I started at Meta4, we were all physicists or mathematicians.
I later figured out that I could get a job as a programmer in more compelling products than HR software and after a couple moves (professionally and across the world) I started writing software in a Physical Oceanography lab in Seattle. We had tons of climatological data. Now, since none of the big data storages existed back then, we proudly stored it all in MySQL 2.23. And we used perl for such a task.
Years later, Amazon.com, which by then was starting to be a big company, needed programmers to migrate from its old codebase (C++ and HTML, not kidding) to a "new" perl stack. I worked at Amazon for the better part of a decade and learned there most of what I know about scale, prioritization and delivering software.
From Amazon I moved to Tuenti and later to Wikimedia, the non profit that maintains the infrastructure for Wikipedia where I was responsible for what is now the Data Engineering team for 6 years.
I was a Principal Engineer in Wikimedia's Technology team which includes infrastructure teams like SRE but also performance and Research. Wikimedia is a small organization and the Technology team is less than 200 people.
We had been working remotely before it was cool and it is not rare that teams that have 7 people would have three time zones. Having a daily sync up to go through the kanban board is key. The backlog of Wikimedia is public and you can take a look here.
The rest of the day is spent pairing with other teams or team members. Once a month we have an event (on irc) where we interact with the community so anyone is invited to ask data questions.
The data engineering team at Wikimedia maintains what would be a traditional data pipeline for product teams to gather data to aid in product execution
The biggest difference from Working in Wikipedia versus anywhere else (except very large companies of the likes of Google or Netflix) is that due to privacy and cost reasons, everything runs on-prem. There are no managed service offerings on the cloud. Wikimedia does everything from racking boxes to configuring the CDN. If tomorrow AWS goes completely black, Wikipedia will continue working just the same.
If tomorrow AWS goes completely black, Wikipedia will continue working just the same.
Data at Wikipedia
The data engineering team at Wikimedia maintains what would be a traditional data pipeline for product teams to gather data to aid in product execution. What is different in the open knowledge ecosystem is that much of the metadata about Wikipedia is also public. This means that you can get a myriad of datasets with pageviews for every article, unique devices, editors per country, etc.
All this data is delivered every hour and has been so for more than 15 years. It is not an overstatement to say that Wikipedia's open datasets have played a key role in the development of many technologies we take for granted, like NLP. Maintaining all the pipelines for data delivery for a system that at peak gets 200,000 requests per second is needless to say, quite a task.
It is not an overstatement to say that Wikipedia's open datasets have played a key role in the development of many technologies we take for granted, like NLP
Something that is not well-known is that the whole puppet repo that describes Wikipedia's is public and you can take a look at it on github. To this day I find this impressive, that every single piece of infra of all Wikipedia's systems is described there. We also provide pretty pictures of the stack.
The data stack is very standard. As you mentioned we have Kafka for data intake, Hadoop as a persistent storage and from those two data gets ingested into Cassandra or Druid. There are different pipelines with different bells and whistles, some of them have a JSON schema registry and some others parse data out of the HTTP requests directly. In Wikipedia, 97% of requests at all times are cached and thus served from the CDN via Varnish. So, in order to gather all this data, we need Varnish to be able to talk to Kafka, we have a custom piece called varnishkafka that is a source of joy (#not) for our SRE team. The migration to Apache Traffic Service will eventually render this piece of infra obsolete.
There are two distinct sources of data: data from readers and editors. While data for readers is very high volume is quite "simple". It can be thought as simple pageviews. Edit data is however a lot more complicated and in order to harvest it properly we developed a Lambda-ish architecture. We source this data two ways, once a month from the MediaWiki database directly and via Spark, after 2 days of processing in a ~60 nodes hadoop cluster we create large denormalized tables. And at the same time we have event streams of data that publish every edit event to wikipedia as a json blob real time and those also get persisted on hadoop. We made a very conscious decision to use JSON versus AVRO in most of our data pipelines and, since then, the Kafka ecosystem has moved more towards being a lot more friendly towards JSON. Makes sense, because JSON is a lot easier to debug.
The visualization layer is Superset, a nifty UI tool that can be used against druid but also Presto, a very fast data query engine developed by Facebook that can be deployed on top of hadoop and it is fully ANSI SQL compliant.
Everything is developed from the ground up so we favor open source solutions tried and tested at scale. Much of the data pipeline of Wikipedia exists thanks to Facebook and LinkedIn open source efforts.
If we need a piece of software that, for example, pulls data from Kafka we survey the ecosystem and look at what exists that is tried and tested and fully open source. We do couple small prototypes and evaluate results. See for example our recent spike on Airflow and others. Licenses that do not provide the same level of freedom as CC0 are a problem. We cannot, for example, use the licenses for Kafka Connect which are too restrictive. We have been using Elastic since 2014 but that, with their new license, might need to change.
Something that I have learned in these 20 years of work is that most of the real hard problems have to do with people, rather than technology.
Now, I would say the hardest problems in the data realm on 2021 have to do with Privacy. At Wikipedia Privacy is paramount (there cannot be truly free access to knowledge without a guarantee of privacy) and we had to "invent" methods to calculate in privacy-conscious ways metrics that are the norm for web properties, like Monthly Active Users. You can see how we did it. Still, the hardest problem was probably communicating effectively how much we care about Privacy.
Data quality issues are kind of a "fractal" problem, you are never done eliminating those completely. Here is an example of a problem with data quality that was invisible to data throughput alarms and here is the first idea we had on how to partially solve issues like these (spoiler: entropy counts). Now, these examples make apparent how other quality issues of more complex nature would slip by.
I would say it is recently that streaming or event-based architectures have become an achievable reality, little by little we are moving towards a world with more streaming services.
My two favorite projects are more data scienc-y than data engineer-y and in both I collaborated with our team of Researchers which are the ones that really came into gauge whether the engineering ideas are mathematically sound.
Bot detection: It took us a bit but we finally were able to flag a lot of the automated traffic that Wikipedia gets and tag it as such on our data releases and internal reports. Some of this traffic is benign (in the sense of it only using resources). An example. Other traffic, however, is not so innocuous. There are many attempts to manipulate Wikipedia's top pageview lists, for example. Most recently someone was trying to add a bunch of obscene terms to the top pageview list of Hungarian Wikipedia.
While on first instance, this seems like a simple classification problem, getting a labeled dataset to do prediction is actually not that simple. It requires building pseudo-sessions on top of data that has no common identifiers to know that two requests came from the same entity. We ended up using heuristics rather than ML and things worked out pretty well. You can read about the details of it here.
Censorship Alarms: We wanted to identify events in which active censorship of Wikipedia sites is ongoing. Wikipedia was blocked in Turkey for years and it is today blocked in Mainland China. Besides these countries like Iran block Wikipedia on and off. Looking at that problem in detail we realized that it had a lot in common with problems with data quality so we used some of the same techniques to alarm when "it seemed" a country had an anomalous traffic pattern. Again, this is less easy than it seems at first sight because you do not want to alarm unnecessarily and uniform traffic drops do not constitute an event. Anomalous traffic drops do. See how we did it here.
Lessons learned are many: in environments where the work is principle-based rather than profit based defining metrics is not easy. When everyone cares a lot about the larger mission of their work there is going to be a lot of strong discussions, once consensus is reached execution is fast with zero management oversight. Also, principles are very useful when choices are needed so in a way, working on a strong principle based organization makes some choices easy (ex: only use open source, preserve the right to fork) even if those choices imply a lot of work.
The larger Technology team reports to the CTO, there are several parallel teams: Security, SRE, Data Engineering, Performance... Those teams work independently but coordinate among themselves. The mission of the Data Engineering team is to serve internal customers but also the large external community. I think for a team in an organization as large in impact as Wikimedia is crucial to have a mission statement, otherwise it is easy to get lost in the many (infinite, really) requests for work. Our mission was to "empower and support data informed decision making across the Foundation and the Community". We make Wikimedia related data available for querying and analysis to both Wikimedia employees and the different Wiki communities and stakeholders."
This mission, notice, involves serving a large group of stakeholders, the community of editors that do not work at the Wikimedia Foundation.
General questions about data engineering
Streaming is becoming a paradigm that is easier and easier to use. Flink was an early project in this space but the very heavy java-way that you interact with their product hinder its massive adoption (citation needed, this is just a hunch). Also, tools like dbt that give data scientists and analytics a way to get into engineering with an interface they are comfortable with (SQL) are becoming more and more popular.
I would love to see open source releases around differential privacy, there a few from google but they are not that easy to use.
Python, always the best second choice for ANY task.
So many, my claim to fame in that department is having deactivated Mechanical Turk for every single user of it in production when I thought I was shutting down, ahem, my desktop.
Someone more cool headed than me did re-started the service right away, now the fact that something like that could even happen tells you how brittle where the tools we used in the early days at Amazon.
There is an emphasis about collecting more and more data w/o a clear purpose. In all realms, and more data just leads to more noise. For example: there had been many debates about collecting individuals location data with the recent covid pandemic. It is not clear how this data will be used nor how it is useful to the government organizations that just plain missed out earlier signs that covid was spreading across the planet. In general, I find that our ability to collect data nowadays far exceeds our ability to analyze it. Collection is done by machines but true analysis and meaning extraction is a task that needs humans.
Collection is done by machines but true analysis and meaning extraction is a task that needs humans.
I get a few news from the Data Science Weekly (a newsletter) and via following some people on twitter that I have liked from prior conferences. I went recently to Enigma, a conference in the intersection of privacy, data and security and highly recommend checking out prior talks there (@enigmaconf).
Ditto for PEPR, the conference about Privacy Engineering and respect.
I would really love to understand well Apple's internal frameworks, ideas and principles when it comes to data management and privacy.