Sampling Random Thoughts - When Data Driven App Smells Bad

Recently I have been involved in reviewing one “Big Data” project. I can’t say that volume of data processed by that app was that big, but the project was treated as real big data project by people who developed it. They used Hadoop just because that is what you use for big data. And even project’s codename had phrase “Big Data” in it.

In short, that project was a data analytics application with ETL, data warehouse and reporting. A variety of technologies were used to build it: Java, Hadoop, Python, Pandas, R with UI implemented in PHP and Shiny Apps embedded as IFRAMEs. But honestly that project was a real mess. And everything that could go wrong actually went wrong. Yet it is a good example of common problems and anti-patterns that might happen with data analytics apps. So I’d like to talk about major ones.

Naive Productionizing

There is nothing more permanent than prototypes. Code written as a demo or as a part of research becomes a part of a production system and entire system is built around that prototype code.

It might happen as following. Data scientist starts research using a dump of data, usually in a CSV format, and as a result he/she has a few scripts, it might be Jupyter notebooks written with python, R or etc. Then those notebooks are transformed into a bunch of standalone scripts with a few tweaks, so that scripts can interact with other parts of the system. But surprisingly code that worked well with sample data on a data scientist’s laptop doesn’t work with real data. So developers start to add more workarounds just to make those scripts working in a real production environment introducing more mess.

CSV Driven Architecture

You can find such architecture when somebody tries to glue multiple apps or services together passing data between them in CSV files. For example it can be multi step data analysis process, where each step is a separate script that gets and returns CSV data. There is a very big chance to fall in this if you are building system around existing notebooks, as mentioned in the previous section.

There is nothing wrong with CSV format, it is good for development, research, data exploration or even as interchange format with external systems. But using CSV or any other text format (like JSON and etc.) as a storage for intermediate results is not efficient. CPU time is wasted during serialization/deserialization on writes/reads, not mentioning that text files take more space.

In general you should avoid unnecessary data transformations. Try to do all steps in one job, so you don’t need any intermediate storage. If you need to reuse result of some step, prefer caching it in memory, Spark’s RDD and tables caching are good examples. Another option is to load results into database in a temporary table and query it.

Overlooking Database solutions

Rather than develop queries on top of database, the solution is to read raw data, e.g. logs, and process records one-by-one without using database. There is even more extreme case exists, when database is used for storage, but raw data is unloaded to CSV or is read using DB cursor in memory before being processed. There are no excuses not to use SQL queries, if your data is not huge so relational databases can be used. Databases are designed to efficiently execute queries, so in most cases it has better performance than custom-written solution. Same also applies to NoSQL solutions.

So try to design you system to write data into database and run analytical queries there. If the amount of data is big you still have options like MemSQL, or you can use columnar oriented storage like parquet or ORC File with distributed query execution engines like Presto or Spark. Of course there are cases when specific analysis required that’s might be hard to implement in query, but at least try to get as much as possible from storage layer. Push down filtering, aggregation and etc. to database and then process resulted smaller dataset with custom logic.

Non incremental queries

In general processing same data all over again is a bad smell. Suppose you need a per-month report, so you get all historical data and aggregate it by month. But in order to keep it up to date each month you regenerate entire report using all historical data. This solution is simple and yet is not efficient: it wastes resources crunching all historical data all over again. Making this query incremental is more efficient solution. In this case just add aggregated data for a new month to previously calculated result.

Example above demonstrates very simple case, although in more complex cases like dashboards, it is possible to make incremental queries by updating previous result.

Improper scaling

Imagine you have a working app that processes data it might be a java app, python script, etc. And you are pretty satisfied with its performance. But one day app begins to crash with out of memory error. Suddenly the volume of data has increased, maybe due to natural growth, or you started to use the app with real data in production instead of testing environment. So you decide to simply use more powerful instance to run this app. Now days it’s so easy to do using cloud providers. But there are might be a couple of problems with such fix. First, this is not a long term solution, since amount of data volume might continue to increase and you will have to upgrade again and again. Second, if you simply used more powerful instance, chances are that you also got more CPU cores, which will do nothing if your app is single threaded, which is a case for python/pandas. So you will simply waste money on unused CPU resources.

The opposite is also true. E.g. Hadoop is used when much simpler solutions are suitable. Those posts are good examples of this point.

I don’t say that scaling up is worse than scaling out, the point is you must be aware of tradeoffs, limitation and consequences of each option. Be sure that you do not waste resources. Use solutions that support scaling out, or use cluster managers like Mesos for better resources utilization.

Reinventing the wheel

There was a Hadoop job that generated CSV output in a project I mentioned in the beginning. And CSV generation was implemented by simply joining fields using comma delimiter. If you think that writing CSV is simple as concatenation of fields using comma, and parsing CSV is simple as splitting string by comma, then you are wrong. CSV is not only about commas. So some rows in resulted CSV were malformed, making them not processable by other tools. The worst thing here is that script on a next step just skipped malformed rows without failing. And missing data from bad rows cannot be treated as random noise in data.

So the lesson is never ever write your own data serialization library for CSV, JSON or any other format, many languages and frameworks has well-developed libraries for this. It is always a bad idea to reinvent the wheel, especially for such common tasks, like writing/reading CSV. But this anti-pattern is not only about data formats. It is also applies to integrating with third party services and etc.

Takeaways

I don’t aim to hate any particular technology like Hadoop, pandas or etc., you need to know your options.

As mentioned in the beginning it all starts with wrong attempt to reuse research or prototype code in production. What could have been done instead? Yes, prototypes are important. And based on its results developers need to figure out what real business needs are. Since not all reports that data scientists came up with might be useful. Then understand what really happens in those prototype apps:

What is input data and what is an output?
What are business requirements?
Is it real time reports or not?
Ask how data transformations can be simplified: combining or even removing some intermediate steps?
Understand what is better data storage, database, plain files, etc.
Understand data flows, can all or some calculations be pushed down to the storage layer.

And last but not least, do not forget to try out multiple options, experiment with different approaches and technologies, and run benchmarks.

That’s all that I have at this moment. Surely there are much more anti-patters in data processing area. Have you seen any of those anti-patterns? I would love to hear your stories and thoughts on this topic.