Data Engineers Must Be Adept in Several Areas - 14 to be exact
In the Big Data industry we spend an enormous amount of time and effort deciphering the role of Data Scientists, drawing Data Science unicorns (figuratively) and discussing to the nth degree the relative importance of programming vs. problem solving skills in candidates. I can’t lie, at QuantHub we share the same obsession with all things Data Science.
Recently though, I was at a large Data and Analytics conference and a speaker threw up a slide similar to the image above. Staring up at the (gasp!) 14 skills on the slide, several of which implied that even more underlying skills were needed, I was reminded that our focus is often on communicating to customers and industry players about the rare combination of diverse skills needed to fill a Data Scientist role. Why this focus? Because Data Science seems to be the immediate need that everyone is seeking to fill en masse in the race to deploy AI solutions.
But what about Data Engineers and these 14 skills? Are these not just as rare and diverse a set of unicorn-like skills?
Reactions to the Data Engineering Slide
Well, when I put this slide out to some folks on LinkedIn and asked if a Data Engineer can meet all of these requirements, here are some comments I received from industry professionals:
“Ah - the search for the unicorn! I find the statistics is often the missing spoke, but with a good foundation, the right person can develop this.” - Analytics recruiting consultant
Here’s a very telling response:
“I actually felt pretty great about myself with this diagram which is unusual for me. Then I realized that like others it’s taken 20 years to acquire, hundreds of data sets, close to a hundred companies and thousands of hours training others and problem solving with data. What I do know for sure is that the interested should pursue the foundation and don’t cancel themselves out because they decide they can’t. You can be a solid addition to any team if you build the right foundation.” – Data Management consultant
Also very telling was this reaction:
“Oh my -- you've hit a nerve! I could go for hours on this topic but won't. Is it my imagination or did we overlook the fact that Engineers are now responsible for deployments, monitoring, and even environment configuration. (As I heard someone call it --- "Dev STOPS not Dev Ops"). If your engineers are doing non-solution development work - Dev Stops. And we engineers aren't trained in these disciplines so on occasion it becomes "Dev Oooops". I've got plenty of examples of the wrong person making the wrong decision resulting in increased costs or even risk of data exposure. I'll get off the soapbox now...” – BI and Technical PM
14 Data Engineering Skills – A Summary
Yikes. A brief overview of some of the skills on the slide tells a little bit about the nature of what they are talking about in terms of the complexities of a Data Engineering job:
- Programming languages – At a minimum a sound working knowledge of Python is required. But usually master of Java, Scala, Presto, Hive, R or any number of other languages can be necessary depending on which part of the data pipeline is involved. In an ideal world the Data Engineer would be working alongside a Data Scientist making sure that their code is reusable and machine learning-friendly at scale.
- Database Management - Extensive knowledge of database languages and tools is required to do data engineering. Data Engineers must ensure that different databases are available to all users and functions without any hiccups. SQL and NoSQL are required skills here, along with advanced DBMS knowledge/skills.
- Analytics/BI tools – One would think that this is the realm of the Data Scientists. It is for the most part. However, a Data Engineer must build a pipeline that supports data analysis and machine learning, so it helps to understand the terminology and outputs of the end users. Data Engineers also need to use statistical modeling on the job, for example, to measure the usage rate of data in a database.
- Cloud/On Premises - Companies and their external data suppliers have data stored in various cloud systems and on-premise. A Data Engineer has to bring all these databases together to enable everyone in the company to use this data.
- Solid Knowledge of Operating Systems – Since operating systems make a pipeline work, a Data Engineer must know the ins and outs of different networks, virtual machines, server management, Linux, UNIX based systems, Windows, and more.
- Containers – Containers are lightweight versions of traditional virtual machines that make it easy and less costly for teams to deploy, manage, and scale distributed applications. They provide many advantages for Data Scientists. The rapid rise of containerized applications is a good example of how the nature of a Data Engineer’s job is constantly evolving with new skill sets required.
- Domain and Business Expertise – In most organizations, there’s a tremendous amount of legacy business information contained in the company data. Without that domain knowledge, subtleties in the data are often missed, leading to data quality issues. Additionally, without domain and business expertise, it’s difficult to imagine the work of the data engineer aligning very well with the strategy of the business.
- Optimization – The key skill here is not just to be able to build a data pipeline but to build one that is scaleable and efficient. Higher level skills are needed to design and build a data warehouse that can optimize the performance of queries, and when the data warehouse becomes very large, find new ways of making analyses perform.
- Data Governance and Security – While Data Engineers are not typically responsible for data governance, they must ensure systems are in place for data access and user control. They need to be aware of data governance concepts and be sure that any tools and platforms they put in place support proper data governance.
- Creativity - Clearly, data engineers are expected to have a wide array of technical expertise. Like Data Science however, the job also requires critical thinking and the ability to solve problems creatively. This might include creating solutions that don’t yet exist.
- Collaboration - In addition, data engineers must also be able to work effectively in collaboration with other data experts and communicate results and recommendations to colleagues without technical backgrounds. Most problems with big data are people and team issues. They are not technical issues (at least not initially). Technology usually gets blamed because it’s far easier to blame technology than to look inward at the team itself. Until you solve your personnel issues, you won’t hit the really tough technical issues or create the value with big data you set out to create.
- DevOps - There is a skillset and mindset that comes with being in an Ops role and it can be difficult to find developers who have this mindset.
- Machine learning/AI – You might think that machine learning is the territory of a Data Scientist. But if AI is the top of the pyramid of business needs, then the ability to collect and move data is a primary need for a business to get to the top of the pyramid.
- Streaming/Real Time - With advancements in technology, more and more prediction is done in real-time by deploying a model into the streaming pipeline and performing model scoring on every data point in a data stream. We also live increasingly in a world of real-time information and decision making. Building a streaming data pipeline (rather than batch based) is yet another new set of skills that Data Engineers must implement.
Phew. That IS a lot of skills (and sub-skills)!
Moreover, regarding the overall skill set required of a Data Engineer, the ability to create a data pipeline is one thing. It’s another thing to be able to create a system that allows an organization to rapidly deploy data pipelines, monitor them and ensure fault tolerance of the entire system, all in a cost-effective manner that satisfies end user needs and business goals. Achieving this might entail bringing together perhaps 10-30 different big data technologies. Again, that’s a lot of skills!
Data Engineers are Equally Important to Data Scientists
All of this has reminded me of the sometimes-overlooked importance of the Data Engineering role. Gartner shed some light on this subject when it said in back in 2016 that only 15% of big data projects make it into production. That really is a dismal result for all the effort going into big data.
While there must be numerous reasons for this low success rate, one school of thought out there to explain this statistic is that companies are so focused on getting to the insights from Data Science, that they fail to put in place the data pipelines and workflows that can allow data to be useful to the business on an ongoing basis, according to service level agreements and within a necessary time frame to make it valuable. This is the job of a Data Engineer.
The importance of the Data Engineer was reflected in the words of one Netflix Data Scientist who stated: Good data engineering lets Data Scientists scale. Netflix follows the “one for one rule” - it has as many Data Engineers as Data Scientists, and Data Engineers are equally important.
Finding Data Engineers
Both those in the Data Engineering profession and those trying to hire Data Engineers have a tough job. To find a Data Engineer, you need to find someone who has developed a boatload of skills across a wide variety of disciplines - even more than the Data Engineering slide entails. The role requires a complex combination of tasks into one single role. (Sound familiar Data Scientists?) And to be a Data Engineer, you must embody that unicorn.
The problem is, there is currently no coherent or formal education or career path available for Data Engineers. Most folks in this role got there by learning on the job, rather than following a detailed route or set of academic courses - like our friend the Database Management consultant. And one software developer who commented in reaction to the slide is also living proof:
“I can cover almost all of the roles at various levels, but it's taken 20 years and without a team even with all of that ability a single person isn't going to produce magic.”
And another development manager seconded, “Yeah, only so many hours in a day.”
So what can you do?
Hiring practices that focus on finding a single person that can basically cover all roles are limiting because the pool of candidates will be such a small number that hiring will take forever, if you can even find the “right” person at all. It's certainly possible to have most or all those skills, but it's pretty tough to find in a single person that hasn’t been working for at least 20 years.
In a recent post, we advocated for an approach to building Data Science capabilities that encouraged a move away from expecting a single “unicorn” (or even two unicorns) to have all the necessary skills to do the job, to a more “portfolio”- based approach to developing Data Science capabilities. We would argue that for the Data Engineering role, the same approach is necessary.
Even if we test for a lot of skills that apply to Data Engineers it would be difficult to develop an assessment to test for all of these skills in one go and expect one person to ace it.
As with Data Scientists, our recommendation would be to decide which specific skill sets you need and build a portfolio of talent with those skills. For instance, you might form a team of a data product manager/owner, a Data Scientist, and a Data Engineer and “cross pollinate” skill sets. Our friend the software developer of 20 years recommended a team of three: a highly skilled coder with an understanding of data science functions, business expert / business analyst, and a statistics expert.
Some would argue that this portfolio approach would be more expensive. However, if your data workflow is not efficient, the end results in terms of the lack of Data Science effectiveness and efficiency as well as Data Scientist frustration and turnover will cost you more.
Lastly, because of a shortage of Data Engineers and the fact that they are pretty expensive, it makes a lot of sense to look internally for software engineers, or perhaps even Data Scientists, who can bridge their skills to those of a Data Engineer role. You can use a test like QuantHub to assess strengths and weaknesses and then provide training, tools, and mentoring they need to be able to fill the role of Data Engineer.
Along these lines, in its recent whitepaper “Data Engineering is Critical to Driving Data and Analytics Success” Gartner also recommends finding Data Engineers by hiring recent graduates and developing them internally.
At QuantHub we test for Data Engineering skills in addition to Data Science skills because we recognize that both roles are needed to get the job done. With the ever increasing volumes of enterprise data and new technologies appearing all the time, Data Engineers have become vital members of any analytics team. As evidenced by these 14 skills, their role brings a lot to the table in terms of capabilities that impact the outcomes of Data Science and analytics efforts across the organization.
The problem of finding people who possess these multiple skill sets will just get worse. So, we might as well learn from the world of Data Science and start building Data Engineering teams using some of the methods we see happening in that field – hire graduates and entry level employees with a long term view towards developing them into Data Engineers, hire from within where possible, and hire a team (rather than a person) that fills out the portfolio of Data Engineering skills your organization needs.