Splunk on AWS EC2 Cloud

Whats is Splunk ?

Splunk is a log, monitoring and reporting tool for IT system administrators with search capabilities. It crawls logs, metrics, and other data from applications, servers and network devices and indexes it in a searchable repository from which it can generate graphs, SQL reports and alerts. Splunk can be easily set on the AWS machine archival storage as EBS volumes and periodically syncing the archive from EBS to S3 Bucket or taking EBS snapshots for backup of the logs for the future use.

Generally its hard to track the logs from the server. We do have different monitoring tools such as Nagios, Zabix, here is a new tool named Splunk, which is a kind of bigger solution for providing monitor the visibility inside all the dynamic and complex environment. For example you have an application seems to be very slow, its not because the app have some issue , its because of the lack of free memory on the server. Such kind of details can be obtained from inside the splunk server.

Why do we go for Splunk ?

In auto-scaled where the instances are running under load-balancer scenarios, the servers gets scale up and down, and also there are some situations like some instance gets terminated without any alert. During this situation it will be good to get the login sessions during the server-down state, also the server access logs, so that we can track the reason for the server down. Managing logs on server is really hard, and also the logs will be available on different location. Inorder to address this problem, here we have setup Splunk to listen on a TCP port for any network traffic passes all others servers log to this host, then you will have a centralized, indexed log repository for all of your services.

Here i will guide you on deploying the splunk on the AWS EC2 and configuring splunk forwarder on the remote machine. Splunk is very flexible and is easy to install on any servers. You can select the appropriate hardware capacity planning for your Splunk deployment from here.

Once you have installed the Splunk server , follow the steps given below to start the app:

Now start the Splunk using the command given below:
[NOTE: The here Splunk is installed in /opt location]

/opt/splunk/bin/splunk start

Now you can access the Splunk web UI using the URL given below:


The Splunk need to be configure in such a way that it should be able to receive the data from the remote machine. For this you will need is to follow the following steps:

1. Login to Splunk WebUI eg.
2. Go to Manager –> Forwarding and receiving –> Receive data
3. Click on New Button and add default port i.e. 9997
4. Click on save button to save the settings.
NOTE: Make sure that the port is opened for the server to accept the data from the remote machine.

Next you will need to install Splunk forwarder on the remote machine. Once you have installed the forwarder start the app as shown below:

/opt/splunk/bin/splunk start

Then enable the forwarder using the command and restart the Splunk app.

./splunk enable app SplunkLightForwarder -auth
Splunk username: admin
Password: changeme
./splunk add forward-server -auth admin
./splunk restart

Now after few minutes you can see the Splunk dashboard indexes all it logs on the realtime dashboard.

Generally in Splunk deployment , we have a deployment server which pushes the configuration on to the deployment client, grouped into server class. The Splunk deployment server is a centralized manager which manages several splunk instances known as deployment client. The deployment client is the Splunk instance installed on the remote machine and parse the log on to the Splunk deployment server.



The Splunk generally collects the data from the remote machine which contain  the machine-to-machine and also from human-to-machine interaction. With these collected data it indexes to the engine and generates the reports and also drives alert. The email alert can be configured for the specific conditions like. For example we can configure the alert mail when it finds any log containing the error messages. The Splunk will access all these large volume of data and also provides the visibility and intelligence to IT and data ware house. And also will be able to perform the real-time and historic analysis of all the bulk data from the remote machine.

Its easy to use, also to install and also easier deploy method make this application different from others. The Splunk will be very useful for the developer team for finding and fixing the bugs and also helps to provide real time insights.

Mapreduce using Hadoop + pig/hive on AWS EC2 hadoop cluster

This article discuss about running mapreduce jobs using the apache tools called pig and hive.Before we can process the data we need to upload the files to be processed to HDFS/S3.  We recommend uploading to hdfs and keeping the important files in s3 for backup is a better practice. s3 is easily accessible from commandline using tools like s3cmd. HDFS is a failover cluster filesystem which provides enough protection to your data over instance failures.


MapReduce is a programming model and an associated implementation for processing and generating large data sets. We can specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The main steps hadoop takes to run a job are

  1. The client, which submits the MapReduce job.
  2. The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
  3. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
  4. The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

Hadoop Map/Reduce is very powerful, but

o   Requires a Java Programmer.

o   Harder to write and also time consuming.

o   Difficult to update frequently.

A solution is to Run jobs using pig(Piglatin)/hive(HiveQL).


• An engine for executing programs on top of Hadoop

• It provides a language, Pig Latin, to specify these programs

Pig has Two main parts:

– A high level language to express data analysis

– Compiler to generate mapreduce programs (which can run on top of Hadoop)

Pig Latin is the name of the language with which Pig scripts are written. Pig also provides an interactive shell for executing simple commands, called Grunt. Pig Latin is a high level language. Pig runs on top of Hadoop. It collect the data for processing from Hadoop HDFS filesystem and Submit the jobs to the Hadoop mapreduce system.

A sample mapreduce job (like a Hello World program) using pig is given below

It is assumed that you are on one of the machines which is a part of a hadoop cluster having NameNode/DataNode as well as JobTracker/TaskTracker setup.

We will be executing piglatin commands using grunt shell. Switch to hadoop user first .

Consider we have a file ‘users’ on our local filesystem which contain data to be processed.First we have to upload it to hdfs. Then

# pig -x mapreduce

this command will take you to grunt shell. Pig Latin statements are generally

organized in the following manner:

A LOAD statement reads data from the file system.Then we process the data.And writes output to the file system using STORE statement. A DUMP statement displays output to the screen.

grunt> Users = load ‘users’ as (name, age);

grunt> Fltrd = filter Users by age >= 18 and age <= 25;

grunt> Pages = load ‘pages’ as (user, url);

grunt> Jnd = join Fltrd by name, Pages by user;

grunt> Grpd = group Jnd by url;

grunt> Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

grunt> Srtd = order Smmd by clicks desc;

grunt> Top5 = limit Srtd 5;

grunt> store Top5 into ‘top5sites’;

We can also view the progress of the job through the web interface http://<ipaddress of jobtracker machine>:50030.

Tools like PigPen (an eclipse plugin) are available  that helps us create pig-scripts, test them using the example generator and then submit them to a hadoop cluster.

There is another tool called oozie – Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs.

Pig tasks can be modeled as a workflow in oozie. These are deployed to the Oozie server using a command line utility. Once deployed, the workflows can be started and manipulated as necessary using the same utility. Once the workflow is started Oozie will run through each flow.. The web console for Oozie server can be used to monitor the progress of various workflow jobs being managed by the server.



Pig, was causing some slowdowns at Facebook company as it needed training to bring business intelligence users up to speed. So the development team decided to write Hive which has an SQL like syntax.

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools for querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. Also it allows custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Some of the queries in HiveQL are given below, which is very similar to the SQL.

# show tables;

# describe <tablename>;

# SELECT * FROM <tablename> LIMIT 10;

#  CREATE TABLE table_name

#  ALTER TABLE table_name RENAME TO new_table_name

#  DROP TABLE table_name

NoSQL databases like Cassandra provide support for hadoop. Cassandra supports running Hadoop MapReduce jobs against the Cassandra cluster. With proper cluster configuration, MapReduce jobs can retrieve data from Cassandra and then output results either back into Cassandra, or into a file system.

Cassandra Cluster on AWS EC2 with Cassandra 7.x and ubuntu 10.04

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together  Dynamo’s fully distributed design  and Bigtable’s ColumnFamily-based data model.

In a cluster, Cassandra nodes exchange information about one another using a mechanism called Gossip. The nodes in a cluster needs to know one another.  Nodes named “seed”s are the centre of this communication mechanism. It’s customary to pick a small number of relatively stable nodes to serve as your seeds. Do make sure that each seed also knows of at least one other. Having two nodes is what is preferred.

Lets have a look at how we can bring a Cassandra cluster up with Cassandra 7.x on ubuntu 10.04

First of all you have to install the java/jdk .  As that is out of scope for our discussion please do it on your own and let’s start with cassandra.

Add the following repositories to your apt sources list

vim /etc/apt/sources.list.d/cassandra.list

[bash]deb http://www.apache.org/dist/cassandra/debian 07x main
deb-src http://www.apache.org/dist/cassandra/debian 07x main[/bash]

Import the following keys and add it to apt-key


gpg –keyserver keyserver.ubuntu.com –recv-keys 4BD736A82B5C1B00

gpg –export –armor 4BD736A82B5C1B00 | sudo apt-key add –

gpg –keyserver keyserver.ubuntu.com –recv-keys F758CE318D77295D

gpg –export –armor F758CE318D77295D | sudo apt-key add –



[bash]apt-get update[/bash]

and make sure that no error is there with accessing the packages.

Installing cassandra on all nodes(machines) with  which we intend to build the cluster.

[bash]apt-get install cassandra  –yes[/bash]

Now edit the configuration file for Cassandra

vim /etc/cassandra/cassandra.yaml

Here  I will discuss the important directives that has to be edited for the cluster to take effect


eg:  initial_token:  136112946768375385385349842972707284582

This parameter determines the position of each node in the Cassandra ring. Initial token for the first seed node should be ‘0’.Here is a simple Python script that helps to calculate the token values.


#! /usr/bin/python

import sys

if (len(sys.argv) > 1):



num=int(raw_input(“How many nodes are in your cluster? “))

for i in range(0, num):

print ‘node %d: %d’ % (i, (i*(2**127)/num))


executing this script will prompt you for the no. of nodes in your cluster. Then it will output the initial tokens for each node.

For eg: Consider a 2 node cluster, the tokens will be

node 0: 0

node 1: 85070591730234615865843651857942052864

auto_bootstrap: false

You can set this to false as we are just going to start the cluster for the first time.


-< ip address >

As I told you earlier, the seeds mentioned here will control the communication between the nodes.

You can give the ips of the two nodes here  for which you assigned the first two initial tokens generated by the script above.





This seed entries should be the same on all nodes of the cluster.




You can leave both empty.

Starting  the Cassandra

For starting Cassandra you can either use an init script/ or the command “cassandra”. Here I will use the second option.

As Cassandra service was started during the installation some values will be stored in /var/lib/cassandra/data directory. So Before starting Cassandra follow these steps.


1)      /etc/init.d/cassandra stop

2)      rm –rf  /var/lib/cassandra/data

3)      mkdir /var/lib/cassandra/data


After doing these steps on all the nodes please run the following  command to start Cassandra on each node starting from the seed node 1

[bash]# cassandra &[/bash]

After starting Cassandra on all the nodes you can check the cluster status using the following command

[bash]nodetool -h <ip of the node >  -p 8080 ring[/bash]


[bash]nodetool -h localhost -p 8080 ring[/bash]

Achieving HIPAA on AWS / EC2 with Windows Server 2008

When you are creating a HIPAA compliant system on cloud service like AWS / EC2 / S3, you have to carefully examine the different levels of data security provided by the Cloud Service provider

At a minimum level, the following should be ascertained:

i) Where is the Cloud provider’s data center physically located. In some countries, HIPAA restricts Protected Health Information ( PHI ) to be stored on servers located outside of the country.

ii) Whether the cloud provider contractually obligated to protect the customer’s data at the same level as the customer’s own internal policies?

iii) Cloud provider’s Backup and Recovery policies

iv) What are the provider’s policies on data handling/management and access control? Do adequate controls exist to prevent impermissible copying or removal of customer data by the provider, or by unauthorized employees of the company?

v) What happens to data when it is deleted? This is very important as customers will be storing data on virtual Machines. Also What happens to cloud hardware when the hardware is replaced?

In this blog we are only looking at the different security levels to be taken by the application developer to make sure that a web application built on AWS / EC2 using Windows Server 2008 / .NET / MSSQL / IIS 7 / is HIPAA compliant. The basic requirement is to encrypt all the data at rest and transit

1. Encrypting Data in transit between the user ( clients ) and the server ( Webserver )


Steps used to Implement SSL on IIS are the following:

1.Open IIS Manager.
2.Click on the server name.
3.Double-click the “Server Certificates” button in the “Security” section
4.Click on self-signed certificate
5.Enter certificate name and click ok
6. Select the name of the server to which the certificate was installed.

7. From the “Actions” menu (on the right), click on “Bindings.” This will open the “Site Bindings” window

8. In the “Site Bindings” window, click “Add” This will open the “Add Site Binding” window

9. Under “Type” choose https. The IP address should be the IP address of the site , and the port over which traffic will be secured by SSL is usually 443. The “SSL Certificate” field should specify the certificate that was installed in step 5.

10.Click “OK.” . SSL is now installed .

2 ) Encrypting Data at Rest ( Document Root )

EFS with IIS

You can use EFS ( Encrypted File System ) in Windows 2008 Server to automatically encrypt your data when it is stored on the hard disk.

Encrypt a Folder:

1. Open Windows Explorer.
2. Right-click the folder that you want to encrypt , and then click Properties.
3. On the General tab, click Advanced.
4. Under Compress or Encrypt attributes, select the Encrypt contents to secure data check box and then click OK.
5. Click OK.
6. In the Confirm Attribute Changes dialog box that appears, use one of the following steps:
i) If you want to encrypt only the folder, click Apply changes to this folder only, and then click OK.
ii) If you want to encrypt the existing folder contents along with the folder, click Apply changes to this folder, subfolders and files, and then click OK.

The folder becomes an encrypted folder. New files that you create in this folder are automatically encrypted

3 ) Encrypting MSSQL Database ( Data at Rest )

TDE ( Transparent Data Encryption )

TDE is a new feature inbuilt in MSSQL Server 2008 Enterprise Edition . Data is encrypted before it is written to disk; data is decrypted when it is read from disk. The “transparent” aspect of TDE is that the encryption is performed by the database engine and SQL Server clients are completely unaware of it. There is absolutely no code that needs to be written to perform the encryption and decryption .So there is no need for changing any code ( Database Queries ) in the Application .


i) Create a Master Key

A master key is a symmetric key that is used to create certificates and asymmetric keys. Execute the following script to create a master key:

USE master;

ii)Create Certificate

Certificates can be used to create symmetric keys for data encryption or to encrypt the data directly. Execute the following script to create a certificate:

WITH SUBJECT = ‘TDE Certificate’

iii) Create a Database Encryption Key and Protect it by the Certificate

1.Go to object explorer in the left pane of the MSSQL SERVER Management Studio
2.Right Click on the database on which TDE Requires
3.Click Tasks and Navigate to Manage Database Encryption
4. Select the encrytion algorithm (AES 128/192/256) and select the certificate you have created
5.Then Mark the check Box for Set Database Encryption On

You can query the is_encrypted column in sys.databases to determine whether TDE is enabled for a particular database.

SELECT [name], is_encrypted FROM sys.databases

4 ) Encrypting Data in transit between the Webserver and the MSSQL Database

MSSQL secure connection using SSL

i) Creating a self-singned cert using makecert
makecert -r -pe -n “CN=YOUR_SERVER_FQDN” -b 01/01/2000 -e 01/01/2036 -eku -ss my -sr localMachine -sky exchange -sp “Microsoft RSA SChannel Cryptographic Provider” -sy 12 c:\test.cer

ii) Install this cert

Copy c:\test.cer into your client machine, run c:\test.cer from command window, select “Install Certificate”. -&gt; click “Next” -&gt; select “Place all certificates in the following store” –&gt; click “Browser” -&gt; select “Trusted Root Certification Authorities” -&gt; select OK and Finish

iii) Open SQL Server Configuration Manager

Expand SQL Server Network Configuration, right-click “Protocols for MSSQLSERVER” then click “properties”. On the “Certificate” tab select the certificate just installed . On the “Flags” tab, set “ForceEncryption” YES.

Now SSL is ready to be used on the server. The only modification needed in the .NET code is connection string. It will be

connectionString=”Data Source=localhost;Initial Catalog=mydb;User ID=user1;Password=pas@123;Encrypt=true;TrustServerCertificate=true”

Simulating multiple IP-Camera with h.264 stream in Amazon EC2 using Wowza

When you are setting up a Wowza based streaming application which need to stream and record more than a thousand cameras, and in the testing stage you need to see how the system works by providing multiple H.264 camera streams. But, when you have only one camera for testing purposes, you cannot overload the camera by taking a thousand streams from it to test the application. And if the camera gives an MPEG-4 stream, Wowza is not going to play since H.264 is the only supported format by it. We did a workaround to overcome this situation in Amazon EC2. We launched a large wowza instance from paid AMI and installed VLC in it. Using VLC we transcoded the MPEG-4 video stream to H.264. Illustration given below
Simulating multiple=
vlc -vvv rtsp://camera.hostname:port/stream-name --sout "#transcode{venc=x264{keyint=60,profile=baseline,level=3.0,nocabac, qpmax=36,qpmin=10,me=hex,merange=24,subme=9,qcomp=0.6},vcodec=x264,vb=128,scale=1, width=640,height=480,acodec=mp4a,channels=1,fps=15,samplerate=4750} :rtp{dst=local.amazon.ip.ofwowzainstance,port-video=10000,port-audio=10002 ,sdp=file:///wowza-installation-dir/content/vlc.sdp}" -R -d

Next we added a username and password to file /usr/local/WowzaMediaServer/conf/admin.password so that we can access the stream manager. Then we had to start wowza server, access the stream manager using the url http://public-dns-name-of.instance:8086/streammanager/

After Login using the username and password mentioned in /usr/local/WowzaMediaServer/conf/admin.password. Click on “start receiving stream” under rtplive.

In the configuration window mentioned Application as rtplive/_definst_ , MediaCaster Type as rtp, and Stream Name as vlc.sdp and clicked “OK” to submit and stream to start. The RSTP url to access the stream was be rtsp://public-dns-name-of.instance:8086/rtplive/vlc.sdp and this give an H.264 stream which is equivalent to a stream from an H.264 camera. The advantage of this setup is you need not overload a single IP camera by taking 1000 streams as this single rtsp output can be used multiple times to simulate a multiple IP-Camera system and feed it as input to the wowza streaming infrastructure we are developing in Amazon EC2.