So how does a team of six engineers - heavily experienced in web development in languages like ReactJS, NodeJs, and Java - go about adopting Python into their work?
The application development and cloud computing technology landscape is always changing and an important part of our role as engineers is to stay up to date on those changes. Sometimes it is through solo work - such as learning a new framework or skill. But sometimes it is through team-based work - such as adopting and migrating a whole project to a new language.
In fact, our entire team consisted of experienced engineers who have delivered multiple web and distributed applications in the cloud, but none had exposure to ETL or data-driven projects.
About one month into the rewrite efforts we were hitting limitations around using Java to migrate legacy system code. We wanted to be able to achieve simple File Operations, as well as complicated queries using Spark, but with dynamically typed language and minimal bootstrap code. This was one of our reasons for considering whether it was time to switch languages.
For our project, we faced the huge task of re-writing multiple jobs running on a legacy ETL platform. This involved enterprise API integrations, as well as complex data analysis and refinement.
Inflexibility of Existing Languages
Due to the nature of these jobs, none of the languages we were most experienced with were a great choice. That’s because they were:
- Static typed languages like Java
- Involved heavy bootstrap code
- Lacked extensive support for data manipulation libraries such as Pandas
- Lacked extensive external community support for Spark integrations or data analysis
Flexibility with Python: API Integration, Data Analysis, and Others
Python seemed to be a good choice for us as it was flexible enough to support a wide array of use cases. It also fit in well as it was:
- Dynamically typed
- Supported re-writing Bash based or ETL jobs in fewer lines of code
- Had a well supported REST interface
- Had excellent support for data manipulation libraries such as Pandas and Spark
There were a few other areas that really cemented our use of Python - these were File Operations and Community Support.
File Operations were key to our project as our process involved reading multiple source files
Python worked well in both cases, including:
- Well suited for simple file manipulations using Pandas
- Worked in complex scenarios using Spark Queries on HDFS
- Needed significantly fewer lines of code to accomplish this than in Java or NodeJS
- Could work for File Operations in memory when source files are few MBs
- Simple enough to make API calls for various Enterprise Layers
As we were assessing adopting a new language, Community Support both within Capital One and without was a key for us. We wanted a language which was well supported by an active open source community that:
- Constantly updates security enhancements
- Resolves outstanding questions or issues
- Actively merges new feature requests from engineers
Outside of Capital One
- Python has a much more extensive community of engineers in the Data Analysis space as compared to Java
- PySpark has much better support than Spark Integration with Java
Within Capital One
- Capital One has a very active community of Python engineers and experts to help teams get started and maintain their Python projects
- This internal community allowed us to seek guidance, as well as go through multiple code reviews
We went through the below key stages from Planning to Production.
Key to learning and adopting a new language was putting in time for foundational work; automating compliance with PEP8 and adopting Py Tooling like Black and Flake8.
Let’s go through some of the key elements to these stages.
Define Standards and Automating Adoption
As our team was new to Python, we spent the first initial few days defining standards on how we would code to comply with PEP8. But in addition to adoption, we needed to automate our workflow to comply with these standards.
- Added a Pre-Commit Hook for Black which automatically formats code on local commits.
- Black didnt catch all violations, which is where Flake8 came in.
- Flake 8 installed as a Pre-Commit hook stopped any code commits where there were outstanding compliance errors with PEP8.
After automating our workflow to comply with standards and a base repo, we started with the core dev work.
This was key given Python is dynamically typed; logging was our solution to better track problems.
- As a Team we Decided on a common Logging format:
``` Code Block
requestid - machine_instanceid
timestamp - YYYY-MM-DD HH:MM:SS,milliseconds
loglevel - INFO, ERROR
modulename - function_name
state - START/END/INPROGRESS
type - SCRIPT/EXTERNAL_API/etc.
modresponse - Success/Error
duration - Tracking External API calls
message - custom message as needed
errormessage - err message
- We leveraged the ELK stack (Elasticsearch, Logstash, and Kibana) for logging.
- By using Kibana as the Web Application UI for our logs, we could see our execution details as well as trace down exceptions in Kibana.
- Adopting a simple library called requests to handle our API calls.
- Spark v/s Pandas: When you perform operations on a dataframe in Spark, a new dataframe/reference is created which is by design. This works well with large datasets but is a hindrance when the dataset is smaller. Hence for filtered smaller datasets under 5MB, we decided to go with Pandas for quick data frame manipulations.
- Automation for compliance to coding standards was a huge time saver as most of our team was new to Python.
- We quickly realized for us to test along with all the dev work, we needed a TDD approach where pytest came into our workspace. This proved extremely helpful.
In addition to being the right choice of tool for our job, exploring and learning Python allowed the team to work more closely together and bond more than ever.
We as a team learned something new together and solved multiple issues as we hit walls; which took our team bonding to great heights!
I hope this has been helpful and would love to learn what languages you and your team have adopted recently; especially in data-driven projects like this, let me know in the comments!
These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2020 Capital One.