LightningFlow Instructions
- 1. Recommended steps - It is recommended that you perform the below steps before launching the LightningFlow EC2. However, please note that these are NOT mandatory:
- 1.1 Create an IAM role to be attached to the LightningFlow EC2 instance. For example, if you intend to use the LightningFlow EC2 to run spark ETL jobs on objects present in S3, then appropriate S3 permissions need to be added to the IAM role.
- 1.2 Create a Windows EC2 instance (preferably a t2.small) for accessing the Airflow UI. Alternatively, if you intend to allow access to Airflow UI from your organization intranet, this step is not required. However, please see instructions on allowing HTTP in the "Connecting to the Airflow Instance" section below.
- 2. Launching the LightningFlow EC2:
- 2.1 From AWS Marketplace, Click on "Continue to Subscribe" button on the top right corner.
- 2.2 Under Terms and Conditions, review the EULA (End user License Agreement) and click on "Accept Terms"
- 2.3 Please wait for Subscription "Effective Date" to appear.
- 2.4 Click on "Continue to Configuration"
- 2.5 On the "Configure this Software" page, select the appropriate options from the drop down menus and click on "Continue to Launch"
- 2.6 On the "Launch this Software" page, under Choose Action, select "Launch through EC2" or "Launch from website".
- 2.7 If you select "Launch from website"
- EC2 instance type: select the instance type. we recommend m4.2xlarge for optimum performance
- VPC Settings - select the VPC
- Subnet Settings - select the apporpriate subnet
- Security group settings - "Create New Based On Seller Settings"
- Enter Name of Security Group and Description. In the inbound rules: For SSH (Port 22), Select "Custom IP" for "Source (IP of Group)" and enter the CIDR range that you wish to allow. If you just want to test, select "My IP". If you want to access Airflow EC2 and Airflow UI from a particular machine, enter the security group, or the specific IP of the instance followed by /32.
- For HTTP (Port 8080) and HTTP (Port 8998), please update "Source (IP or Group) similar to above
- Key Pair Settings - We recommend creating a new key pair using the "Create a key pair in EC2". It will open a new tab. Click on "Create Key Pair", enter a "Key pair name" and download the .pem file.
- Click on "Launch"
- 2.8 If you Select "Launch through EC2", select the appropriate options and click on "Launch" to create the LightningFlow EC2.
- Please Note: Under "Add Storage" you may increase the root volume or add an EBS volume. Default root volume is 100GB. Please note: If you intend to run Spark jobs, it is recommended to add EBS volume depending upon the size of the dataset.
- 3. Connecting to the Airflow Instance:
- 3.1 Use puttygen to generate a .ppk file from the .pem file (key pair) downloaded when creating the instance.
- 3.2 If you need to connect (SSH) directly to the LightningFlow EC2, ensure that the Security group has SSH (Port 22) enabled from your machine IP. For example, if your machine IP is 172.5.5.5 then add 172.5.5.5/32 in the "Source". If you need to connect from anywhere within the organization, then add the organization CIDR range in the "Source". If you need to connect from a Jump server, ensure that either the IP or the Security group of the Jump server is allowed in the inbound rule.
- 3.3 To allow access to Airflow UI from any machine within your organization CIDR, ensure HTTP (Port 8080) is allowed from entire organization CIDR. However, if you intend to limit access to the Airflow EC2 due to security reasons, ensure that HTTP (Port 8080) is allowed from a specific machine or CIDR range.
- 3.4 To view Airflow UI, either from local machine or a specific machine, enter the following URL and press Enter. http://{LightningFlow EC2 Public/Private IP}:8080
- 3.5 Once the LightningFlow EC2 is launched, go to the EC2 console and obtain the Instance ID
- 3.6 From the browser, open http://InstanceIP:8080
- 3.6 Login using 'admin' as the username and the 'Instance ID' as the password.
- 3.7 To enable LDAP authentication, please follow the Airflow guide available at https://airflow.apache.org/docs/stable/security.html#ldap
- 3.8 To run CLI commands from the EC2, login to airflow user using the command 'su - airflow'. Please Note: In case of any changes to airflow.cfg, restart the airflow webserver using the command "sudo service airflow-webserver restart".
- 4. Adding DAGs:
- 4.1 Place your DAGS (.py files) in /home/airflow/airflow/dags/
- 4.2 Airflow scheduler will pick up the newly added DAGS which would be visible on the Airflow UI
- 5. Sample DAG with Spark Livy Operator
- 5.1 Navigate to the folder /home/airflow/airflow/dags/
- 5.2 An example DAG with the name example_spark_livy_operator.py would be available. This DAG can be reused as a template to create new DAGs for calling custom spark scripts. For example, the given DAG template calls a spark script /home/airflow/dataprocessing/spark_script.py. Custom spark scripts can be placed in the /home/airflow/dataprocessing/ directory, and called from within the new DAGS created using the template.
- 5.3 On the Airflow Web UI, the sample DAG would be visible as "example_spark_livy_operator".
- 5.4 Once a new DAG is created, the scheduler will automatically pick it up. Once you refresh the Airflow Web UI, the new DAG should be available.
For details on configuring AWS Access Key ID and Secret Access key for enabling S3 access, please follow the steps documented here:
https://3mlabs-static-website.s3.amazonaws.com/Pyspark+S3+connectivity.html
In case you need any assistance for configuring, deploying and running scalable jobs on AWS EMR in the Production environment, please reach out to us at info@lightning-analytics.com for more details.
Generic information about the solution
- 6. Use Cases:
- 6.1 The AMI can be used to launch an EC2 instance. The AMI is packaged with Airflow components namely, webserver, scheduler and worker configurations, local Spark cluster, Apache Livy, and a postgres database. Once the instance is ready, it can be used to run DAGs for orchestration of ETL jobs. For example:
- 6.2 Ingestion pipelines: Separate DAGs can be created to build ingestion pipelines from different source systems. Each ingestion pipeline code can implement it's own logic to connect/authenticate to the source system, pull the files and write to the specified s3 bucket
- 6.2 ETL pipelines: Each ETL pipeline DAG can implement logic for cleansing, transformation, aggregation and write to S3 or RedShift
- 7. Typical deployment use case:
- 7.1 The solution just launches an EC2 instance with all the required components pre-integrated. Once the instance is launched, no specific configuration is required to be able to start running Airflow DAGs
- 7.2 The deployment requires an IAM role to be created before launching the EC2 instance. The role will need to be specified when launching the instance.
- 7.3 Typical IAM permissions include S3, RedShift, SecretsManager, EMR, Glue, Athena permissions depending on the use case.
- 8. Typical deployment options:
- 8.1 The solution is available as an AMI. Hence, the deployment depends upon the use case. The AMI can be used to launch instances in Single AZ, multi-AZ or multi-region for high availability when used in an auto-scaling cluster configuration.
- 9. Expected time to complete the deployment:
- 9.1 Typical overall time taken to launch the EC2 is between 2 mins to 10 mins depending upon the deployment tools e.g CloudFormation Stack, Jenkins using Terraform etc. Once IAM role is created and configuration options are selected, the time taken depends upon the time the EC2 takes to launch, which is typically less than 2 mins.
- 10. Technical pre-requisites:
- 10.1 An IAM role for the EC2 instance, as discussed in Section 7.2 above.
- 10.2 In case, a KMS key is used for default S3 bucket encryption, the IAM role must have kms:Encrypt and kms:Decrypt permissions on the KMS key
- 10.3 The AMI is pre-built on Cent OS, and configured with the required libraries, built-in postgres DB, hence no separate pre-requisites are needed.
- 11. Required Technical skills or specialized knowledge:
- 11.1 The deployment specifically requires familiarity with AWS services e.g. IAM, S3 and EC2
- 11.2 Basic knowledge of Apache Airflow is necessary to work with Airflow DAGS. Also, knowledge of EMR configuration, tuning etc. will depend upon the use cases.
- 12. Environment configuration:
- 12.1 No specific environment configuration is required, except an IAM role as specified in Sec. 7.2
- 13. Architecture Diagram:
- The following architecture diagram depicts Airflow implementation for workflow mechanism and persistence
- 14. Standard deployment:
- There is no standard deployment that is applicable for the solution packaged as an AMI
- 15. Typical customer deployment:
- 16. Customer data storage:
- Typically customer data can be stored in S3
- 17. Use of root permissions:
- Warning! Using root user to deploy or install packages is strictly not recommended. Please use the 'centos' user to run CLI commands for deployment
- 18. Policy of least privilege:
- As per AWS guidelines on granting IAM privileges, please ensure to follow the standard best practices on least privileges and adding the specific permissions as required and necessary. For more information, please refer to AWS best practices
- 19. Application monitoring and EC2 health check
- To monitor the health of the application, CloudWatch metrics can be configured to send SNS notifications to subscribed users. For more information, please refer to AWS Cloudwatch events health