Email: email@example.com PH: 469-551-9178
Senior Data Engineer with 8+ years of professional experience in IT and strength in BIGDATA
using HADOOP framework and Analysis, Design, Development, Testing, Documentation,
Deployment, and Integration using SQL and Big Data technologies.
Strong coding experience with Python and SQL for ETL.
Strong experience using major components of Hadoop ecosystem components like HDFS, YARN,
MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume,
Strength in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed
system, MongoDB, NoSQL.
Strength in Data Extraction, aggregations, and consolidation of data within AWS Glue using
Hands on Experience with ETL tools such as AWS Glue, Using Data pipeline to move data to AWS
Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide
ETL Solutions and Data Warehouse tools for reporting and data analysis.
Strong Data Orchestration experience using tools such has AWS Step Functions, Lambda, AWS
Data Pipeline, Apache Airflow.
Developed custom airflow operators using Python to generate and load CSV files into GS from
SQL Server and Oracle databases.
Experienced in Parsing structured and unstructured data files using Data Frames in PySpark.
Working experience on Spark (spark streaming, spark SQL) with Scala and Kafka.
Extensive knowledge on RDBMS such as Oracle, Microsoft SQL Server and MySQL.
Extensive experience working on various databases and database script development using SQL
Wrote to Glue metadata Catalog which in turn enables us to query the refined data from Athena
achieving a serverless querying environment.
Processed data in AWS Glue by reading data from MySQL data stores and loading into RedShift
Goldman Sachs & Co January 2022 – Present
Played a lead role in gathering requirements, analysis of entire system and providing estimation
on development, testing efforts.
Involved in designing different components of system like Sqoop, Hadoop process involves map
reduce & hive, Spark, FTP integration to down systems.
Have written hive and spark queries using optimized ways like using window functions,
customizing Hadoop shuffle & sort parameter
Developed ETL’s using PySpark. Used both Dataframe API and Spark SQL API.
Using Spark, performed various transformations and actions and the final result data is saved
back to HDFS from there to target database Snowflake
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for
small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS
Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and
Configured Spark streaming to get ongoing information from the Kafka and store the stream
information to HDFS
Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources
like S3, ORC/Parquet/Text Files into AWS Redshift
Used various spark Transformations and Actions for cleansing the input data
Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous
Enforced standards and best practices around data catalog, data governance efforts
Created Datastage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge,
Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture,
Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch
processing to load into Snowflake for analytical processes.
Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud
service AWS, working along with data steward under data compliance.
Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG,
Used Pyspark for extract, filtering and transforming the Data in data pipelines.
Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack Elasticsearch Kibana
Used Data Build Tool for transformations in ETL process, AWS lambda, AWS SQS
Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s
and dependencies between the tasks.
Experience in Developing Spark applications using Spark – SQL in Databricks for data extraction,
transformation, and aggregation from multiple file formats for analyzing & transforming the
data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark
Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.
Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.
Environment: Red Hat Enterprise Linux 5, HDP 2.3, Hadoop, Map Reduce, HDFS, Hive 0.14, Shell Script,
SQOOP1.4.4, Python 3.2, PostgreSQL, spark 2.4, airflow, snowflake.
Scipher Medicine, Waltham, MA August 2020 – December
Migrate data from on-premises to AWS storage buckets.
Develop a Python script to transfer data from on-premises to AWS S3.
Develop a Python script to hit REST API’s and extract data to AWS S3.
Work on Ingesting data by going through cleansing and transformations and leveraging AWS
Lambda, AWS Glue and Step Functions.
Develop a Python script to extract data from Netezza databases and transfer it to AWS S3.
Develop Lambda functions and assigned IAM roles to run python scripts along with various
triggers (SQS, Event Bridge, SNS).
Create a Lambda Deployment function and configured it to receive events from S3 buckets.
Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation
using commands with Crontab.
Develop various Mappings with the collection of all Sources, Targets, and Transformations using
Exploring with Spark to improve the performance and optimization of the existing algorithms in
Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD’s.
Develop Mappings using Transformations like Expression, Filter, Joiner and Lookups for better
data messaging and to migrate clean and consistent data.
Design and implement Sqoop for the incremental job to read data from DB2 and load to Hive
tables and connected to Tableau for generating interactive reports using Hive server2.
Use Sqoop to channel data from different sources of HDFS and RDBMS.
Write programs using Spark to move data from Storage input location to output location by
running data loading, validation, and transformation to the data.
Develop Spark applications using PySpark and Spark-SQL for data extraction, transformation and
aggregation from multiple file formats.
Use Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS
using Python and NoSQL databases such as HBase and Cassandra.
Collect data using Spark Streaming from AWS S3 bucket in near-real-time and performs
necessary Transformations and Aggregation on the fly to build the common learner data model
and persists the data in HDFS.
Design, develop, and test dimensional data models using Star and Snowflake schema
methodologies under the Kimball method.
Work on Dimensional and Relational Data Modeling using Star and Snowflake Schemas,
OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
Automate the data processing with Oozie to automate data loading into the Hadoop Distributed
Erwin9.8, BigData3.0, Hadoop3.0, Oracle12c, PL/SQL, Scala, Spark-SQL, PySpark, Python,
kafka1.1, SAS, Azure SQL, MDM, Oozie4.3, SSIS, T-SQL, ETL, HDFS, Cosmos, MS Access.
First Republic Bank, New York, NY August 2019 – July 2020
Evaluated client needs and translating their business requirement to functional specifications
thereby on boarding them onto Spark ecosystem.
Developed Python scripts to load data to Hive and involved in ingesting data into Data
Warehouse using various data loading techniques.
Analyzed SQL scripts and designed the solutions to implement using PySpark.
Extracted, transformed and loaded data to generate CSV data files with Python programming
and SQL queries.
Extracted and updated the data into HDFS using Sqoop import and export.
Developed PySpark UDFs to incorporate external business logic into Hive script and developed
join data set scripts using HIVE join operations.
Created Airflow Scheduling scripts in Python.
Designed data warehouse/ETL& ELT/ Data Load architectures within the Snowflake Ecosystem.
Generated workflows through Apache Airflow, then Apache Oozie for scheduling the Hadoop
jobs which controls large data transformations.
Created, Debugged, Scheduled and Monitored jobs using Airflow and Oozie.
Created various hive external tables, staging tables and joined the tables as per the
requirement. Implement static Partitioning, Dynamic partitioning, and Bucketing.
Worked with various HDFS file formats like Parquet, Json for serializing and deserializing.
Worked with the Spark for improving performance and optimization of the existing algorithms in
Hadoop using Spark Context, Spark-SQL, PySpark, Pair RDD’s, Spark YARN.
Used Spark for interactive queries, processing of streaming data and integration with popular
NoSQL database for huge volume of data.
Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-
structured data coming from various sources.
Tuned and optimized Cassandra queries, including compaction strategy.
Created BASH scripts to automate configure and build new Cassandra nodes.
Supported 6-12 nodes, 2 Confidential Cassandra production clusters; decommission nodes,
add/replace nodes, manage backups, tune Cassandra databases, manage alerts via OpsCenter,
integration with Nagios.
Queried and analyzed data from Datastax Cassandra for quick searching, sorting and grouping
Used the Spark – Cassandra Connector to load data to and from Cassandra.
Supported the preparation of long-term BI plans and Snowflake Data warehouse concepts and
assure that they are in line with business objectives.
Utilized advanced SQL queries on transactional data stores to engineer and administer data
pipelines between our Snowflake data warehouse and application layer.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the
data using the SQL Activity.
Developed ETL Processes in AWS Glue to migrate data from external sources like S3,
ORC/Parquet/Text Files into AWS RedShift.
Installed application on AWS EC2 instances and configured the storage on S3 buckets.
Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
Used the AWS-CLI to suspend an AWS Lambda function.
Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
Designed the data models to be used in data intensive AWS Lambda applications which are
aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage,
definition, of Key Business elements from Aurora.
Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
Used Spark Data Frames Operations to perform required Validations in the data and to perform
analytics on the Hive data.
Developed Apache Spark applications by using spark for data processing from various streaming
Worked on architecture and components of Spark, and efficient in working with Spark Core,
Migrated MapReduce jobs to Spark jobs to achieve better performance.
Designed the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning
Used HUE for running Hive queries.
Created partitions according today using Hive to improve performance.
Developed Oozie workflow engine to run multiple Hive, Pig, Sqoop and Spark jobs.
Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable
Python, SQL, Spark, Snowflake, Hadoop (HDFS, MapReduce), Yarn, Hive, Pig, HBase, Oozie, Hue,
Sqoop, Flume, Cassandra, Oracle, NIFI, Azure Data bricks, Azure Data Lake, AWS Services
(Lambda, EMR, Autoscaling).
Homesite Insurance, Boston, MA December 2017 – July 2019
Worked on complete Big Data flow of the application starting from data ingestion upstream to
HDFS, processing the data in HDFS and analyzing the data and involved using Python.
Developed Python scripts for regular expression (regex) project in the Hadoop/Hive
Used Spark-Streaming APIs to perform necessary transformations and actions on the data
obtained from Kafka.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of
Analyzed the SQL scripts and designed the solution to implement using Scala.
Used Spark-SQL to Load JSON data and create SchemaRDD and loaded it into Hive Tables and
handled structured data using Spark SQL.
Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations.
Used Elastic search for indexing/full text searching.
Data ingestion, Airflow Operators for Data Orchestration, and other related python libraries.
Worked on Airflow 1.8 (Python2) and Airflow 1.9 (Python3) for orchestration and familiar with
building custom Airflow operators and orchestration of workflows with dependencies involving
Coded and developed custom Elastic Search java-based wrapper client using the JEST API.
Used AWS Glue for the data transformation, validate and data cleansing.
Used Python Boto 3 to configure the services AWS glue, EC2 and S3.
Used AWS services like EC2, S3, Autoscaling and DynamoDB.
Worked with data streaming process with Kafka, Apache Spark, Hive.
Worked with various HDFS file formats like Avro, Sequence File, Json and various compression
formats like Snappy, bzip2.
Tested Apache Airflow for building high performance batch and interactive data processing
Configured Flume to extract the data from the web server output files to load into HDFS.
Used Flume to collect, aggregate and store the web log data from different sources like web
servers, mobile and network devices and pushed into HDFS.
Worked on analytics queries for retrieval of data from Hadoop and Presto.
Utilized Flume to filter out the input data read to retrieve only the data needed to perform
analytics by implementing flume interception.
Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
Implemented Real time analytics on Cassandra data using thrift API.
Designed columnar families in Cassandra and Ingested data from RDBMS, performed
transformations and exported the data to Cassandra.
Queried and analyzed data from Cassandra for quick searching, sorting, and grouping through
Developed and designed automate process using shell scripting for data movement.
Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-
processing with Pig.
Loaded data from UNIX file system to HDFS using Shell Scripting.
Python, Spark, Hadoop (HDFS, Map Reduce), Hive, Presto, Cassandra, Pig, Sqoop Airflow,
Hibernate, spring, Oozie, AWS Services EC2, S3, Autoscaling, Elastic Search, DynamoDB, UNIX
Shell Scripting, TEZ.
Macy’s, New York, NY October 2015 – November 2017
Data Engineer/ Hadoop
Imported data from relational data sources to HDFS using Sqoop and Python.
Experienced in various data modeling concepts like star schema, snowflake schema in the
Collected and aggregated large amounts of log data and staging data in HDFS for further analysis
Worked with NoSQL database HBase in creating HBase tables to load large sets of semi-
structured data coming from various sources.
Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce
Developed the Pig UDF’s to pre-process the data for analysis and Migrated ETL operations into
Hadoop system using Python Scripts.
Used Pig as ETL tool to do transformations, event joins, filtering and some pre-aggregations
before storing the data into HDFS.
Worked on testing and migration to Presto.
Used Hive for Querying data which moved to HBase from various sources.
Migrated HiveQL into Impala to minimize query response time.
Coded optimized Teradata batch processing scripts for data transformation, aggregation and
load using BTEQ.
Worked with Microsoft Azure Cloud services, Storage Accounts and Virtual Networks.
Provisioned IAAS & PAAS Virtual Machines and WebWorker roles on Microsoft Azure Classic
and Azure Resource Manager.
Migrated an On – premises Instances or Azure Classic Instances to Azure ARM Subscription with
Azure Site Recovery.
Experienced in Azure Fabric Services.
Experienced in full SDLC that includes Analyzing, Designing, Coding, Testing, implementation &
Production Support as per quality standards using Waterfall, Agile methodology.
Experienced on various Azure services like Compute (Web Roles, Worker Roles), Azure Websites,
Caching, SQL Azure, NoSQL, Storage, Network services, Azure Active Directory, API
Management, Scheduling, Auto Scaling, and Power Shell Automation.
Migrated Teradata tables to Hadoop using Hive and orchestration via internal python
Used Scala to convert Hive/AWS.
Queries into RDD transformations in Apache Spark.
Developed Kafka producer and consumers, HBase clients, Apache Spark and Hadoop
MapReduce jobs along with components on HDFS, Hive.
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
Used Teradata to build Hadoop project and as ETL project.
Extensively used the Teradata utilities like Tpump, Fast Load, Fast Export, Multiload and TPT.
Built Tableau visualizations against Data sources (Flat files, Excel) and Relational DB’s (Teradata).
Developed small distributed applications in our projects using Zookeeper and scheduled the
workflows using Oozie.
Developed several shell scripts, which acts as wrapper to start these Hadoop jobs and set the
Moved all log files generated from various sources to HDFS for further processing through
Developed testing scripts in Python and prepare test procedures, analyze test results data and
suggest improvements of the system and software.
Python, Spark, Hadoop (HDFS, MapReduce), Hive, Pig, Sqoop, Flume, Yarn, HBase, Oozie, Java,
SQL scripting, Teradata Utilities BTEQ, FastLoad, Multiload &Tpump, Linux shell scripting, Eclipse
Dhruv Soft Services Private Limited, Hyderabad, India April 2014 – July 2015
Worked with BI team in gathering the report requirements and Sqoop to export data into HDFS
Worked on the below phases of Analytics using R, Python and Jupiter notebook.
Data collection and treatment: Analyzed existing internal data and external data, worked on
entry errors, classification errors and defined criteria for missing values b.
Data Mining: Used cluster analysis for identifying customer segments, Decision trees used for
profitable and non-profitable customers, Market Basket Analysis used for customer purchasing
behavior and part/product association.
Developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
Assisted with data capacity planning and node forecasting.
Installed, Configured and managed Flume Infrastructure.
Administrator for Pig, Hive and HBase installing updates patches and upgrades.
Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.
Worked on performing major upgrade of cluster from CDH3u6 to CDH4.4.0.
Developed Map Reduce programs to extract and transform the data sets and results were
exported back to RDBMS using Sqoop.
ETL data from sources Systems to Azure Data Storage services using a combination of Azure
Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure
DW) and processing the data in Azure Databricks.
VISPR Patterns were observed in fraudulent claims using text mining in R and Hive.
Exported the data required information to RDBMS using Sqoop to make the data available for
the claims processing team to assist in processing a claim based on the data.
Developed Map Reduce programs to parse the raw data, populate staging tables and store the
refined data in partitioned tables in the EDW.
Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data.
Used HiveQL developed many queries and extracted the required information.
Created Hive queries that helped market analysts spot emerging trends by comparing fresh data
with EDW reference tables and historical metrics.
Imported the data (mostly log files) from various sources into HDFS using Flume.
Enabled speedy reviews and first mover advantages by using Oozie to automate data loading
into the Hadoop Distributed File System and PIG to pre-process the data.
Provided design recommendations and thought leadership to sponsors/stakeholders that
improved review processes and resolved technical problems.
Extracted the data tables using custom generated queries from Excel files, Access / SQL
database and providing exacted data as an input to Tableau.
Managed and reviewed Hadoop log files.
Tested raw data and executed performance scripts.
HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera and
Bachelor’s in engineering, Computer Science, Osmania University, 2012
Audience discretion is needed, Read TOS.
Post New Job / Post Job Wanted / Jobs USA
App & Rate-Us / Sub Job Updates / Category
Are You An HR Educator (Submit Guest Post)
Job Posted from, Hyderabad, Telangana, India