Key Skills For Data Engineer And Resources for Learning

Mai Nguyen
4 min readFeb 25, 2021

Two years ago, I found the Data Science field and fell it love with it. Coming from an analytical and financial background, I initially thought Data Scientist would be the right career path for me. According to the Data Science Skills Matrix on TowardsDataScience, there are three domain of skills that one needs to possess to become a Data Scientist, namely: Modeling & Statistics, Communications and Expertise, and Data Engineering and Programming. At that time, I was experienced and familiar with the first two domains, but I was not confident at all with coding and programming. Therefore, I embarked on a journey to learn and improve my programming skills by jumping in data engineering field. Fast forward to the present, after working as a Data Engineer for over 1 year, I am extremely infatuated with this field and desire to become an excellent engineer.

In this article, I provide the list of the key skills that I am learning and plan to develop for my career. The list is based on the article “10 Key skills, to help you become a data engineer” from https://www.startdataengineering.com/post/10-key-skills-data-engineer/. Learning resources, to-do list and my own progress are given and noted in this as well for my own monitoring.

1. Linux (Not yet started)

Most applications are built on linux systems so it is crucial to understand how to work with them. The key concepts to know are:
- File system commands, such as ls, cd, pwd, mkdir, rmdir
- Commands to get metadata about your data, such as head, tail, wc, grep, ls -lh
- Data processing commands, such as awk, sed
- Bash scripting concepts, such as control flow, looping, passing input parameters

Resources:
- Data Science at the Command Line: https://www.datascienceatthecommandline.com/1e/
- Terminal Commands for Mac users: Macintosh Terminal Pocket Guide: Take Command of Your Mac (better)

2. SQL (In progress)

SQL is crucial to access your data whether it be for running analysis or for use by your application. The key concepts to know are
- Basic CRUD, such as select, where, join (all types of joins), group by, having, window functions
- SQL internals, such as index: different types and how they work, transaction concepts such as locks and race conditions
- Data modeling, such as OLTP schemas like star and snowflake schemas, OLAP schemas like denormalization, key value store, facts and dimensions.

Resources:
- Basic CRUD operations: https://www.w3schools.com/sql/default.asp

3. Scripting
Knowledge of a scripting language such as bash scripting or python is very helpful to automate multiple steps required for processing data. The key concepts to know are:
- Basic DS and concept, such as list, dictionaries, map, filter, reduce
- Control flow and looping concepts, such as if, for loop, list comprehension(python)
- Popular data processing abstraction library such as pandas or Dask in Python

Resources:
- Bash scripting:
- Dask (python):

4. Distributed Data Storage
Knowledge of how distributed data store such as HDFS or AWS S3 works.
Concepts like data replication, serialization, partitioned data storage, file chunking

Resources:

5. Distributed Data processing
Knowledge of how data in processed in a distributed fashion. The key concepts to know are
- Distributed data processing concepts, such as Mapreduce, in memory data processing such as Apache Spark
- Different types of joins across data sets, such as map side and reduce side joins
- Common techniques and patterns for data processing such as, partitioning, reducing data shuffles, handling data skews on partitioning
- Optimizing data processing code to take advantage of all the cores and memory available in the cluster

Resources:

6. Building data pipelines
Knowledge of how to connect different data systems to build a data pipeline. The key concepts to know are:
- A data orchestration tool, such as Airflow
- Common pitfalls and how to avoid them, such as data quality checks after processing
- Building idempotent data pipelines

Resources:

7. OLAP database
Knowledge of how OLAP database operates and when to use them. The key concepts to know are:
- What is a column store and why it is better for most types of aggregation queries
- Data modeling concepts such as partioning, fact and dimensions, data skew
- Figuring out client data query pattern and designing your database accordingly‍

Resources:

8. Queuing systems
Knowledge of queuing systems and when and how to use them. The key concepts to know are
- What is a data producer and a consumer
- Knowledge of offsets and log compaction‍

Resources:

9. Stream processing
Knowledge of what stream processing and how to use them. The key concepts to know are:
- What is stream processing and how is it different from batch processing
- Different types of stream processing such as Event based processing and micro batching

Resources:

10. JVM language
Knowledge of a JVM based language such as Java or Scala will be extremely useful, since most open source data processing tools are written using JVM languages. e.g Apache Spark, Apache Flink, etc

--

--