• Call: 1-73222-666-55

DevOPS on AWS Cloud using Opscode Chef

Rule the Cloud‘ with Chef
Chef is Infrastructure as Code,an API for your entire infrastructure. Assuming that you are well versed with cloud if not still you should have atleast heard of cloud computing and it is still an evolving paradigm and Cloud computing companies are the newest buzz in the IT sector. Chef is used in conjunction with cloud  from cloud providers say Amazon’s AWS. If a software thats being developed is a mix of technology which is interdependent and works in perfect harmony then why not the people behind it, this thought has led to the emergence of a new cultral trend called DevOPS. Now if you setup a number of instances on the cloud then whats next – new instances on cloud are just like bare metal server and the configuration has to be done from scratch and it would be feasible to do so manually for couple of them what if the count just got bigger say 100 live instances with different unix distros, although a script could be written but still it will not suffice,  in the long run considering management too. Here the CHEF comes into play

“chef is sysadmin robot performing configuration tasks automatically and much more quickly than a single admin could ever hope to” – Jesse Robbins, Opscode CEO.

CHEF is an open source configuration management tool using pure-Ruby,the chef domain specific language for writting system configuration related stuff (recipes and cookbook)

CHEF brings a new feel with its interesting naming conventions relating to cookery like Cookbooks (they contain codes for a software package installation and configuration in the form of Recipes), Knife (API tool), Databags (act like global variables) etc

Although there are many configuration management tools prevailing in the industry CHEF was able to secure its position in the race.

“CHEF take a step farther passes puppet and cfengine — like doing “LIVE SEARCH” within  configuration management like loadbalancer can call out to get a list of the app servers you need to balance  or an applicaton server can call out, get a reference to the master database server  etc …..the centralised chef server is indexing all the information about your infrasturctre  so that you could search in the command line using knife you know in real time so that application could lever that data..” by Seth Chisamore from the OPSCODE.

A techonology peak that isnt fluffy – Cloud
For those folks new to cloud- Its a whole bunch of activites which began as an innovation, recently given out as products and now they have become so widespread and so feature complete that they became suitable for utility services.

So if you dont want cloud in your business its like saying you dont want to use the electricity instead you built your own generator and use it according to your need. Now what do we loose if we continue with that is the competitive edge ie you get the pressure to keep your stuff upgraded inorder to find your place relative to the others in the ecosystem.

Cloud is API oriented, everything you see in cloud is ulitmately programmable.

Virtualization is the foundation of Cloud but virtualization is not Cloud by itself. It certainly enables many of the things we talk about when we talk Cloud but it is not necessary sufficient to be a cloud. Google app engine is a cloud that does not incorporate virtualization. One of the reasons that virtualization is great is because you can automate the procurement of new boxes.

A Culture thats on path to revolutionize IT – DevOPS
Devops is something that orginated in webshops predominantly and it require a kind of tools thats really not available except for home grown tools which the big webshops built over and over again. So the organisation who wanted to use devops started using the tools that enable this transition as most organisations depends on web as a source of revenue in a variety of different ways, even the enterprise desire to be as agile as the webshops. This has begun a revolution from the website permeate into the enterprise base more frequently.

Considering a real life example for Devops say facebook, the most popular social networking site here the developers/QA/operations – there is alot of communications, cross talk happening between them like the developers has to write codes, QA who has to make sure the good code goes out, the operations team has to make sure its up and running. Finally all of these has to be in records which altogether seems to be inefficient, this led to the evolving of the entire system. According to the conventional practices where the developers writes the code and throws it off to the testing. Once the testing is done then it moves to the operations etc. Contrary to that the developers , operations team are all involved in the entire lifecycle of the project as a team. This creates a symbiotic relationship. Now the operations people could understand what the engineers needs the most and the developers are able to see the value that operation people brings as they make architecture decisions.

Cloud with your DevOps offers some fantastic properties. The ability to leverage all the advancements made in software development around repeatability and testability with your infrastructure. The ability to scale up as need be real time (autoscaling) and among other things being able to harness the power of self healing systems. DevOps better with Cloud.

Configuration management say CHEF is one of the most fundamental elements allowing DevOps in the cloud. It allows you to have different VMs that have just enough OS that they can be provisioned, automatically through virtualization, and then through configuration management can be assigned to a distinct purpose within the cloud. The CM system handles turning the lightly provisioned VM into the type of server that it is intended to be.

DevOps & Chef
DevOps is nonthing but a cultural movement where everybody say the developers, QA, Operations, Testing etc get along. A project group formation with a mixed skillset that blurs the line between say a developer and sysadmin. This helps the project to meet its deadlines
and avoid unexpected situations. Cloud computing act like a catalyst to this movement. Thereby the CHEF also hops in.

Chef forms a critical layer in the Devops stack.Thanks to the concept of infrastructure as code and virtualization, we can define and build our infrastructure based on text files. Those files can be version-controlled and tested like regular code. The artifact (ami, image), can then be deployed on an infrastructure. The following image gives you an overview on the similarities.

Inadvertently the issues like “what if the application” or “what if the infrasturcture” are resolved, the fact is that application is the infrastructure and infrastructure is the application and we are here to enable business, also it helped bring peoples in the team into better alignment across the board.

Chef configuration is written in pure ruby.

Devops == Ruby

For those who think Bash is enough as a scripting language – Bash becomes a liability not an asset once your script exceeds 100 lines and a total nightmare if you need to parse or output HTML, CSV, XML, JSON, etc. A significant point to be noted is that Chef uses Ruby in its recipes unlike puppet where it uses its own configuration language that is based on Ruby although chef is heavily inspired from puppet. If you chose chef then you are effectively scripting your infrastructure with ruby.

Though Chef was only released on January 15th , 2009 it has gotten rapid adoption and gained a large number of contributors. According to the Opscode wiki there are 545 approved contributors to Opscode projects and 106 companies. Beyond that the #chef IRC channel is typically attended by over 100 users and Opscode staff, signs of a healthy, growing open source community.

Springsource division of VMware have signed on to contribute to the project. They are even being very public about it as seen in this endorsement:

“We are excited about the open source contributions the Springsource Division of VMware has made to Opscode Chef.” said Javier Soltero, CTO of Springsource Management Products at VMware. “Chef is an important tool for automating infrastructure management and we look forward to its continued growth and success.”

Moreover on my experience of using chef I really enjoyed the quick response I could get from the Opscode Support Team for all my queries and they had always being able to direct me towards a solution.

Automation Using Chef to create an Instance on Amazon Cloud Service Provider with Apache webserver configured in it.

Memo
chef-workstation – is the place where we customize our cookbooks and maintains the chef-repo
chef node – is the management node that we create using chef, it configures itself based on its runlist and downloaded cookbooks

The really cool thing with Chef is that you can rerun cookbooks against a node and it will not do anything it has already done i.e it will not change the end result on the target node as defined by the recipes being run against it. So you will always get the same outcome no matter what state the node and actions will not be taken if already done (and conversely run if detected it has not been run).  When reading about Chef you will see this described as being idempotent (There I’ve saved you looking it up).

Prerequisites – an AWS account, EC2 API configured, OS – Ubuntu.

1. Sign up an account at http://www.opscode.com/hosted-chef/# , Here we use the OHC (opscode hosted chef) where we get to create upto 5 nodes for free!!

2.Verify your opscode account.

3.Download the files

Create an organization in the Console page at www.manage.opscode.com, and then download the following files:

  • Your Organization validation key. This is used to automatically register new Chef Clients (like servers you manage).
  • The Knife configuration file.
  • Your User key. This is used to authenticate your user with Hosted Chef.
  • Edit knife.rb  to add aws access key and secret access key
  • knife[:aws_access_key_id]     = “Your AWS Access Key”
  • knife[:aws_secret_access_key] = “Your AWS Secret Access Key”

At this stage I have a chef ready user environment, an OpsCode organisation set up and now I want to start by spinning up an ec2 instance. I will not be going into any depth regarding  the ec2 specifics as that would make this post far too long.

4.Setting Up chef-Workstation

Install Ruby and Development Tools

#sudo apt-get update
#sudo apt-get install ruby ruby-dev libopenssl-ruby rdoc ri irb build-essential wget ssl-cert git-core
#sudo gem update –system

Install RubyGems

#cd /tmp
#wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.10.tgz
#tar zxf rubygems-1.8.10.tgz
#cd rubygems-1.8.10
#sudo ruby setup.rb –no-format-executable

Install Chef

#sudo gem install chef

5.To verify chef installation

#chef-client -v

6.Build the chef repository

#cd ~
#git clone https://github.com/opscode/chef-repo.git

Knife reads configuration files in .chef. so we need to create those as well

#mkdir -p ~/chef-repo/.chef

Copy the keys and knife configuration you downloaded earlier into this directory:

#cp USERNAME.pem ~/chef-repo/.chef
#cp ORGANIZATION-validator.pem ~/chef-repo/.chef
#cp knife.rb ~/chef-repo/.chef

Run the following command to confirm knife is working with the Hosted Chef API.

#cd ~/chef-repo
#knife client list

output : “ORGANIZATION-validator”

7.Now i need to download the apache2 cookbook on to my workstation, customize if required and then upload it to my account on the opscode platform

#knife cookbook site install apache2

this will notify git and also pulls down the desired cookbook

8.Upload the cookbook using the following command

#knife cookbook upload apache2

9.Enter the following command, sit back and  enjoy the show!!!

#knife ec2 server create -G default -I ami-1212ef7b -f m1.small -S <aws ssh key id> -i <ssh identity file> -x root -r ‘recipe[apache2]‘


Before proceeding it would probably be a good idea to take time out and read the Opscode  Chef Recipe wiki which has a nice clear explanation on cookbook name spaces. Also remind yourself of the components that make up a cookbook it’s worth noting that recipes manage resources and those resources will be executed in the order they occur.

Splunk on AWS EC2 CloudSplunk

Whats is Splunk ?

Splunk is a log, monitoring and reporting tool for IT system administrators with search capabilities. It crawls logs, metrics, and other data from applications, servers and network devices and indexes it in a searchable repository from which it can generate graphs, SQL reports and alerts. Splunk can be easily set on the AWS machine archival storage as EBS volumes and periodically syncing the archive from EBS to S3 Bucket or taking EBS snapshots for backup of the logs for the future use.

Generally its hard to track the logs from the server. We do have different monitoring tools such as Nagios, Zabix, here is a new tool named Splunk, which is a kind of bigger solution for providing monitor the visibility inside all the dynamic and complex environment. For example you have an application seems to be very slow, its not because the app have some issue , its because of the lack of free memory on the server. Such kind of details can be obtained from inside the splunk server.

Why do we go for Splunk ?

In auto-scaled where the instances are running under load-balancer scenarios, the servers gets scale up and down, and also there are some situations like some instance gets terminated without any alert. During this situation it will be good to get the login sessions during the server-down state, also the server access logs, so that we can track the reason for the server down. Managing logs on server is really hard, and also the logs will be available on different location. Inorder to address this problem, here we have setup Splunk to listen on a TCP port for any network traffic passes all others servers log to this host, then you will have a centralized, indexed log repository for all of your services.

Here i will guide you on deploying the splunk on the AWS EC2 and configuring splunk forwarder on the remote machine. Splunk is very flexible and is easy to install on any servers. You can select the appropriate hardware capacity planning for your Splunk deployment from here.

Once you have installed the Splunk server , follow the steps given below to start the app:

Now start the Splunk using the command given below:
[NOTE: The here Splunk is installed in /opt location]

/opt/splunk/bin/splunk start

Now you can access the Splunk web UI using the URL given below:

http://domain.com:8000

The Splunk need to be configure in such a way that it should be able to receive the data from the remote machine. For this you will need is to follow the following steps:

1. Login to Splunk WebUI eg. http://10.10.10.35:8000
2. Go to Manager –> Forwarding and receiving –> Receive data
3. Click on New Button and add default port i.e. 9997
4. Click on save button to save the settings.
NOTE: Make sure that the port is opened for the server to accept the data from the remote machine.

Next you will need to install Splunk forwarder on the remote machine. Once you have installed the forwarder start the app as shown below:

/opt/splunk/bin/splunk start

Then enable the forwarder using the command and restart the Splunk app.

./splunk enable app SplunkLightForwarder -auth
Splunk username: admin
Password: changeme
./splunk add forward-server 10.10.10.35:9997 -auth admin
./splunk restart

Now after few minutes you can see the Splunk dashboard indexes all it logs on the realtime dashboard.

Generally in Splunk deployment , we have a deployment server which pushes the configuration on to the deployment client, grouped into server class. The Splunk deployment server is a centralized manager which manages several splunk instances known as deployment client. The deployment client is the Splunk instance installed on the remote machine and parse the log on to the Splunk deployment server.

 

 

The Splunk generally collects the data from the remote machine which contain  the machine-to-machine and also from human-to-machine interaction. With these collected data it indexes to the engine and generates the reports and also drives alert. The email alert can be configured for the specific conditions like. For example we can configure the alert mail when it finds any log containing the error messages. The Splunk will access all these large volume of data and also provides the visibility and intelligence to IT and data ware house. And also will be able to perform the real-time and historic analysis of all the bulk data from the remote machine.

Its easy to use, also to install and also easier deploy method make this application different from others. The Splunk will be very useful for the developer team for finding and fixing the bugs and also helps to provide real time insights.

Mapreduce using hadoop + pig/hive on AWS EC2 hadoop cluster

This article discuss about running mapreduce jobs using the apache tools called pig and hive.Before we can process the data we need to upload the files to be processed to HDFS/S3.  We recommend uploading to hdfs and keeping the important files in s3 for backup is a better practice. s3 is easily accessible from commandline using tools like s3cmd. HDFS is a failover cluster filesystem which provides enough protection to your data over instance failures.

Mapreduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. We can specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The main steps hadoop takes to run a job are

  1. The client, which submits the MapReduce job.
  2. The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
  3. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker.
  4. The distributed filesystem (normally HDFS), which is used for sharing job files between the other entities.

Hadoop Map/Reduce is very powerful, but

o   Requires a Java Programmer.

o   Harder to write and also time consuming.

o   Difficult to update frequently.

A solution is to Run jobs using pig(Piglatin)/hive(HiveQL).

Pig

• An engine for executing programs on top of Hadoop

• It provides a language, Pig Latin, to specify these programs

Pig has Two main parts:

– A high level language to express data analysis

– Compiler to generate mapreduce programs (which can run on top of Hadoop)

Pig Latin is the name of the language with which Pig scripts are written. Pig also provides an interactive shell for executing simple commands, called Grunt. Pig Latin is a high level language. Pig runs on top of Hadoop. It collect the data for processing from Hadoop HDFS filesystem and Submit the jobs to the Hadoop mapreduce system.

A sample mapreduce job (like a Hello World program) using pig is given below

It is assumed that you are on one of the machines which is a part of a hadoop cluster having NameNode/DataNode as well as JobTracker/TaskTracker setup.

We will be executing piglatin commands using grunt shell. Switch to hadoop user first .

Consider we have a file ‘users’ on our local filesystem which contain data to be processed.First we have to upload it to hdfs. Then

# pig -x mapreduce

this command will take you to grunt shell. Pig Latin statements are generally

organized in the following manner:

A LOAD statement reads data from the file system.Then we process the data.And writes output to the file system using STORE statement. A DUMP statement displays output to the screen.

grunt> Users = load ‘users’ as (name, age);

grunt> Fltrd = filter Users by age >= 18 and age <= 25;

grunt> Pages = load ‘pages’ as (user, url);

grunt> Jnd = join Fltrd by name, Pages by user;

grunt> Grpd = group Jnd by url;

grunt> Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;

grunt> Srtd = order Smmd by clicks desc;

grunt> Top5 = limit Srtd 5;

grunt> store Top5 into ‘top5sites’;

We can also view the progress of the job through the web interface http://<ipaddress of jobtracker machine>:50030.

Tools like PigPen (an eclipse plugin) are available  that helps us create pig-scripts, test them using the example generator and then submit them to a hadoop cluster.

There is another tool called oozie – Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs.

Pig tasks can be modeled as a workflow in oozie. These are deployed to the Oozie server using a command line utility. Once deployed, the workflows can be started and manipulated as necessary using the same utility. Once the workflow is started Oozie will run through each flow.. The web console for Oozie server can be used to monitor the progress of various workflow jobs being managed by the server.

Hive

Pig, was causing some slowdowns at Facebook company as it needed training to bring business intelligence users up to speed. So the development team decided to write Hive which has an SQL like syntax.

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop. It provides tools for querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called HiveQL, that enables users familiar with SQL to query the data. Also it allows custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Some of the queries in HiveQL are given below, which is very similar to the SQL.

# show tables;

# describe <tablename>;

# SELECT * FROM <tablename> LIMIT 10;

#  CREATE TABLE table_name

#  ALTER TABLE table_name RENAME TO new_table_name

#  DROP TABLE table_name

NoSQL databases like Cassandra provide support for hadoop. Cassandra supports running Hadoop MapReduce jobs against the Cassandra cluster. With proper cluster configuration, MapReduce jobs can retrieve data from Cassandra and then output results either back into Cassandra, or into a file system.

HADOOP Cluster on AWS EC2 with hadoop-0.20 and ubuntu-10.04

Let’s start with a small introduction- what is hadoop ?. Hadoop is an open-source project administered by the Apache Software Foundation. Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Dealing with big data requires two things:

  • Inexpensive, reliable storage; and
  • New tools for analyzing unstructured and structured data.

Hadoop creates clusters of machines and coordinates work among them. Clusters can be built with inexpensive computers.If one fails, Hadoop continues to operate the cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster.

HDFS manages storage on the cluster by breaking incoming files into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers.

The main services running in a hadoop cluster will be

1)namenode

2)jobtracker

3)secondarynamenode

These three will be running only on a single node(machine) ; that machine is the central machine which controls the cluster.

4)datanode

5)tasktracker

These two services will be running on all other nodes in the cluster.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

Above the file systems comes the MapReduce  engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs. The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible.

The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node.

Now Let us have a look at how to build a hadoop cluster using Cloudera hadoop-0.20 on ubuntu-10.04

You should install sun –jdk  first. Then add the following repositories to the apt sources list.

vim /etc/apt/sources.list.d/cloudera.list


deb http://archive.cloudera.com/debian lucid-cdh3u0 contrib

deb-src http://archive.cloudera.com/debian lucid-cdh3u0 contrib

Import key

curl -s http://archive.cloudera.com/debian/archive.key | apt-key add -

Then run

apt-get update

For Namenode/Jobtracker ( These two services should run only on a single central machine in the cluster)


apt-get install hadoop –yes

apt-get install hadoop-0.20-namenode

apt-get install hadoop-0.20-jobtracker

apt-get install hadoop-0.20-secondarynamenode

Configuration

vim /etc/hadoop/conf/hadoop-env.sh

Append these


export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24/   ( your java home comes here )

export HADOOP_CONF_DIR=/etc/hadoop/conf

export HADOOP_HOME=/usr/lib/hadoop-0.20

export HADOOP_NAMENODE_USER=hdfs

export HADOOP_SECONDARYNAMENODE_USER=hdfs

export HADOOP_DATANODE_USER=hdfs

export HADOOP_JOBTRACKER_USER=mapred

export HADOOP_TASKTRACKER_USER=mapred

export HADOOP_IDENT_STRING=hadoop

vim /etc/hadoop/conf/core-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://< ip address of this machine >:8020</value>

</property>

</configuration>

vim /etc/hadoop/conf/hdfs-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/var/lib/hadoop-0.20/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/var/lib/hadoop-0.20/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

</configuration>

vim /etc/hadoop/conf/mapred-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>< ip address of this machine >:8021</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/var/lib/hadoop-0.20/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/var/lib/hadoop-0.20/mapred</value>

</property>

</configuration>

——————————————————————————————————————————————


mkdir  / var/lib/hadoop-0.20/name

mkdir  / var/lib/hadoop-0.20/data

mkdir  / var/lib/hadoop-0.20/system

mkdir  / var/lib/hadoop-0.20/mapred

chown -R hdfs /var/lib/hadoop-0.20/name

chown -R hdfs /var/lib/hadoop-0.20/data

chown -R mapred /var/lib/hadoop-0.20/mapred

Now format NameNode

yes Y | /usr/bin/hadoop namenode –format

Start namenode

/etc/init.d/hadoop-0.20-namenode start

Check the log Files for error:

less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-<ip>.log

Also you can check whether the Namenode process is up or not using the command

# jps

Start the SecondaryNamenode

/etc/init.d/hadoop-0.20-secondarynamenode start

Log: less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-<ip>.log


sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-0.20/system

sudo -u hdfs hadoop fs -chown mapred /var/lib/hadoop-0.20/system

Now Start the JobTracker

/etc/init.d/hadoop-0.20-jobtracker start

Log : less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-ip-10-108-39-34.log

Now  jps  command will show the three processes up

# jps

19233 JobTracker

18994 SecondaryNameNode

18871 NameNode

For Datanode/Tasktracker ( These two services should be running on all the other machines in the cluster )


apt-get install hadoop-0.20-datanode

apt-get install hadoop-0.20-tasktracker

Configuration

vim /etc/hadoop/conf/core-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

&nbsp;

<!-- Put site-specific property overrides in this file. -->

&nbsp;

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://< ip address of the namenode >:8020</value>

</property>

</configuration>

vim /etc/hadoop/conf/hdfs-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

&nbsp;

<!-- Put site-specific property overrides in this file. -->

&nbsp;

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/var/lib/hadoop-0.20/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/var/lib/hadoop-0.20/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

</configuration>

vim /etc/hadoop/conf/mapred-site.xml


<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

&nbsp;

<!-- Put site-specific property overrides in this file. -->

&nbsp;

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>< ip address of jobtracker  >:8021</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/var/lib/hadoop-0.20/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/var/lib/hadoop-0.20/mapred</value>

</property>

</configuration>

———————————————————————————————————————————————


mkdir  /var/lib/hadoop-0.20/data/

chown -R hdfs /var/lib/hadoop-0.20/data

mkdir /var/lib/hadoop-0.20/mapred

chown -R mapred /var/lib/hadoop-0.20/mapred

Start the DataNode

/etc/init.d/hadoop-0.20-datanode start

Log : less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-<ip>.log

Start the TaskTracker

/etc/init.d/hadoop-0.20-tasktracker start

Log: less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-<ip>.log

You can now check the interface

http://< namenode-ip >:50070   – for HDFS overview

and

http://< jobtracker –ip>:50030  – for Mapreduce overview

Cassandra Cluster on AWS EC2 with Cassandra 7.x and ubuntu 10.04

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together  Dynamo’s fully distributed design  and Bigtable’s ColumnFamily-based data model.

In a cluster, Cassandra nodes exchange information about one another using a mechanism called Gossip. The nodes in a cluster needs to know one another.  Nodes named “seed”s are the centre of this communication mechanism. It’s customary to pick a small number of relatively stable nodes to serve as your seeds. Do make sure that each seed also knows of at least one other. Having two nodes is what is preferred.

Lets have a look at how we can bring a Cassandra cluster up with Cassandra 7.x on ubuntu 10.04

First of all you have to install the java/jdk .  As that is out of scope for our discussion please do it on your own and let’s start with cassandra.

Add the following repositories to your apt sources list

vim /etc/apt/sources.list.d/cassandra.list

deb http://www.apache.org/dist/cassandra/debian 07x main
deb-src http://www.apache.org/dist/cassandra/debian 07x main

Import the following keys and add it to apt-key


gpg --keyserver keyserver.ubuntu.com --recv-keys 4BD736A82B5C1B00

gpg --export --armor 4BD736A82B5C1B00 | sudo apt-key add -

gpg --keyserver keyserver.ubuntu.com --recv-keys F758CE318D77295D

gpg --export --armor F758CE318D77295D | sudo apt-key add –

Execute

apt-get update

and make sure that no error is there with accessing the packages.

Installing cassandra on all nodes(machines) with  which we intend to build the cluster.

apt-get install cassandra  --yes

Now edit the configuration file for Cassandra

vim /etc/cassandra/cassandra.yaml

Here  I will discuss the important directives that has to be edited for the cluster to take effect

initial_token:

eg:  initial_token:  136112946768375385385349842972707284582

This parameter determines the position of each node in the Cassandra ring. Initial token for the first seed node should be ‘0’.Here is a simple Python script that helps to calculate the token values.


#! /usr/bin/python

import sys

if (len(sys.argv) > 1):

num=int(sys.argv[1])

else:

num=int(raw_input("How many nodes are in your cluster? "))

for i in range(0, num):

print 'node %d: %d' % (i, (i*(2**127)/num))

executing this script will prompt you for the no. of nodes in your cluster. Then it will output the initial tokens for each node.

For eg: Consider a 2 node cluster, the tokens will be

node 0: 0

node 1: 85070591730234615865843651857942052864

auto_bootstrap: false

You can set this to false as we are just going to start the cluster for the first time.

seeds:

-< ip address >

As I told you earlier, the seeds mentioned here will control the communication between the nodes.

You can give the ips of the two nodes here  for which you assigned the first two initial tokens generated by the script above.

Example:

Seeds:

-192.168.1.10

-192.168.1.13

This seed entries should be the same on all nodes of the cluster.

listen_address:

&

rpc_address:

You can leave both empty.

Starting  the Cassandra

For starting Cassandra you can either use an init script/ or the command “cassandra”. Here I will use the second option.

As Cassandra service was started during the installation some values will be stored in /var/lib/cassandra/data directory. So Before starting Cassandra follow these steps.


1)      /etc/init.d/cassandra stop

2)      rm –rf  /var/lib/cassandra/data

3)      mkdir /var/lib/cassandra/data

After doing these steps on all the nodes please run the following  command to start Cassandra on each node starting from the seed node 1

# cassandra &

After starting Cassandra on all the nodes you can check the cluster status using the following command

nodetool -h <ip of the node >  -p 8080 ring

or

nodetool -h localhost -p 8080 ring

Achieving HIPAA on AWS / EC2 with Windows Server 2008

When you are creating a HIPAA compliant system on cloud service like AWS / EC2 / S3, you have to carefully examine the different levels of data security provided by the Cloud Service provider

At a minimum level, the following should be ascertained:

i) Where is the Cloud provider’s data center physically located. In some countries, HIPAA restricts Protected Health Information ( PHI ) to be stored on servers located outside of the country.

ii) Whether the cloud provider contractually obligated to protect the customer’s data at the same level as the customer’s own internal policies?

iii) Cloud provider’s Backup and Recovery policies

iv) What are the provider’s policies on data handling/management and access control? Do adequate controls exist to prevent impermissible copying or removal of customer data by the provider, or by unauthorized employees of the company?

v) What happens to data when it is deleted? This is very important as customers will be storing data on virtual Machines. Also What happens to cloud hardware when the hardware is replaced?

In this blog we are only looking at the different security levels to be taken by the application developer to make sure that a web application built on AWS / EC2 using Windows Server 2008 / .NET / MSSQL / IIS 7 / is HIPAA compliant. The basic requirement is to encrypt all the data at rest and transit

1. Encrypting Data in transit between the user ( clients ) and the server ( Webserver )

SSL over HTTP ( HTTPS )

Steps used to Implement SSL on IIS are the following:

1.Open IIS Manager.
2.Click on the server name.
3.Double-click the "Server Certificates" button in the "Security" section
4.Click on self-signed certificate
5.Enter certificate name and click ok
6. Select the name of the server to which the certificate was installed. 

7. From the "Actions" menu (on the right), click on "Bindings." This will open the "Site Bindings" window

8. In the "Site Bindings" window, click "Add" This will open the "Add Site Binding" window

9. Under "Type" choose https. The IP address should be the IP address of the site , and the port over which traffic will be secured by SSL is usually 443. The "SSL Certificate" field should specify the certificate that was installed in step 5.

10.Click "OK." . SSL is now installed .
 

2 ) Encrypting Data at Rest ( Document Root )

EFS with IIS

You can use EFS ( Encrypted File System ) in Windows 2008 Server to automatically encrypt your data when it is stored on the hard disk.

Encrypt a Folder:

 1. Open Windows Explorer.
 2. Right-click the folder that you want to encrypt , and then click Properties.
 3. On the General tab, click Advanced.
 4. Under Compress or Encrypt attributes, select the Encrypt contents to secure data check  box and then click OK.
 5. Click OK.
 6. In the Confirm Attribute Changes dialog box that appears, use one of the following steps:
 i) If you want to encrypt only the folder, click Apply changes to this folder only, and then click OK.
 ii) If you want to encrypt the existing folder contents along with the folder, click Apply changes to this folder, subfolders and files, and then click OK.
 

The folder becomes an encrypted folder. New files that you create in this folder are automatically encrypted


3 ) Encrypting MSSQL Database ( Data at Rest )

TDE ( Transparent Data Encryption )

TDE is a new feature inbuilt in MSSQL Server 2008 Enterprise Edition . Data is encrypted before it is written to disk; data is decrypted when it is read from disk. The “transparent” aspect of TDE is that the encryption is performed by the database engine and SQL Server clients are completely unaware of it. There is absolutely no code that needs to be written to perform the encryption and decryption .So there is no need for changing any code ( Database Queries ) in the Application .

STEPS

i) Create a Master Key

A master key is a symmetric key that is used to create certificates and asymmetric keys. Execute the following script to create a master key:

USE master;
CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'Pass@word1';
GO
 

ii)Create Certificate

Certificates can be used to create symmetric keys for data encryption or to encrypt the data directly. Execute the following script to create a certificate:

CREATE CERTIFICATE TDECert
WITH SUBJECT = 'TDE Certificate'
GO
 

iii) Create a Database Encryption Key and Protect it by the Certificate

1.Go to object explorer in the left pane of the MSSQL SERVER Management Studio
2.Right Click on the database on which TDE Requires
3.Click Tasks and Navigate to Manage Database Encryption
4. Select the encrytion algorithm (AES 128/192/256) and select the certificate you have created
5.Then Mark the check Box for Set Database Encryption On
 

You can query the is_encrypted column in sys.databases to determine whether TDE is enabled for a particular database.

SELECT [name], is_encrypted FROM sys.databases
GO
 


4 ) Encrypting Data in transit between the Webserver and the MSSQL Database

MSSQL secure connection using SSL

i) Creating a self-singned cert using makecert

makecert -r -pe -n "CN=YOUR_SERVER_FQDN" -b 01/01/2000 -e 01/01/2036 -eku 1.3.6.1.5.5.7.3.1 -ss my -sr localMachine -sky exchange -sp "Microsoft RSA SChannel Cryptographic Provider" -sy 12 c:\test.cer
 

ii) Install this cert

Copy c:\test.cer into your client machine, run c:\test.cer from command window, select "Install Certificate". -&gt; click "Next" -&gt; select "Place all certificates in the following store" --&gt; click "Browser" -&gt; select "Trusted Root Certification Authorities" -&gt; select OK and Finish
 

iii) Open SQL Server Configuration Manager

Expand SQL Server Network Configuration, right-click "Protocols for MSSQLSERVER" then click "properties". On the "Certificate" tab select the certificate just installed . On the "Flags" tab, set "ForceEncryption" YES.
 

Now SSL is ready to be used on the server. The only modification needed in the .NET code is connection string. It will be

connectionString="Data Source=localhost;Initial Catalog=mydb;User ID=user1;Password=pas@123;Encrypt=true;TrustServerCertificate=true"
 

SSL for Tomcat on AWS EC2

To launch an AWS/EC2 instance, at first setting up a security group to specify what network traffic is allowed to reach the instance. Then select an AMI and launch an instance from it. And create a volume in the same zone of the instance and attach with it. Format the device and mount it to a directory. After that follow the steps to create SSL for Tomcat:

1. For the tomcat we need java, so create a directory to save the Java Binary file.

 mkdir /usr/java
cd /usr/java 

2. Download jdk binary file (jdk-x-linux-ix.bin) here
Use URL http://www.oracle.com/technetwork/java/archive-139210.html

3. Execute the Binary file

 /usr/java/jdk-x-linux-ix.bin 

Now we have the Java in our device. Then Download the Tomcat and install it followed by the instructions:-

1. Create a directory to save the tomcat

 mkdir /usr/tomcat
cd /usr/tomcat 

2. Download tomcat source file (apache-tomcat-x.tar.gz) here
Use URL http://apache.hoxt.com/tomcat/tomcat-6/v6.0.32/bin/

3. Extract that file

 tar -zxvf apache-tomcat-x.tar.gz 

4. Edit the catalina.sh file

 vim /usr/tomcat/apache-tomcat-x/bin/catalina.sh 

#** Add at the top **
JAVA_HOME=/usr/java/jdk1.x.x_x

save and exit
5. Start the tomcat

 /usr/tomcat/apache-tomcat-x/bin/startup.sh 

6. We can see the logs by using the given command

tail -f /usr/tomcat/apache-tomcat-x/logs/catalina.out 

7. Take the browser and enter the URL http://localhost
Now we can see the tomcat index page

8. To stop the tomcat

/usr/tomcat/apache-tomcat-x/bin/shutdown.sh 

Now configure the SSL Certificate for tomcat. When you choose to activate SSL on your web server you will be prompted to complete a number of questions about the identity of your website and your company. Your web server then creates two cryptographic keys – a Private Key and a Public Key. The Public Key does not need to be secret and is placed into a Certificate Signing Request (CSR) – a data file also containing your details.

Create a self signed certificate authority (CA) and keystore.

1. Make a directory to hold the certs and keystore. This might be something like:

 mkdir /usr/tomcat/ssl
cd /usr/tomcat/ssl 

2. Generate a private key for the server and remember it for the next steps

openssl genrsa -des3 -out server.key 1024

Generating RSA private key, 1024 bit long modulus
…………………..++++++
…++++++
e is 65537 (0×10001)
Enter pass phrase for server.key:
Verifying – Enter pass phrase for server.key:

3. Generate a CSR (Certificate Signing Request). Give the data after executing this command

openssl req -new -key server.key -out server.csr

Enter pass phrase for server.key:
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter ‘.’, the field will be left blank.
—–
Country Name (2 letter code) [GB]:
State or Province Name (full name) [Berkshire]:
Locality Name (eg, city) [Newbury]:
Organization Name (eg, company) [My Company Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, your name or your server’s hostname) []:
Email Address []:

Please enter the following ‘extra’ attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:

4. Remove the passphrasse from the key

cp server.key server.key.org
openssl rsa -in server.key.org -out server.key

Enter pass phrase for server.key.org:
writing RSA key

5. Generate the self signed certificate

openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

Signature ok
subject=/C=GB/ST=Berkshire/L=Newbury/O=My Company Ltd
Getting Private key

You should then submit the CSR. During the SSL Certificate application process, the Certification Authority will validate your details and issue an SSL Certificate containing your details and allowing you to use SSL. Typically an SSL Certificate will contain your domain name, your company name, your address, your city, your state and your country. It will also contain the expiration date of the Certificate and details of the Certification Authority responsible for the issuance of the Certificate.

Create a certificate for tomcat and add both to the keystore

1. Change the path to ssl

cd /usr/tomcat/ssl

2. Create a keypair for ‘tomcat’

keytool -genkey -alias tom  -keyalg RSA -keystore tom.ks

Enter keystore password:
Re-enter new password:
What is your first and last name?
[Unknown]:
What is the name of your organizational unit?
[Unknown]:
What is the name of your organization?
[Unknown]:
What is the name of your City or Locality?
[Unknown]:
What is the name of your State or Province?
[Unknown]:
What is the two-letter country code for this unit?
[Unknown]:

Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
[no]: yes

Enter key password for <tom>
(RETURN if same as keystore password):
Re-enter new password:

3. Generate a CSR (Certificate Signing Request) for tomcat

keytool -keystore tom.ks -alias tom -certreq -file tom.csr

Enter keystore password:

4. create unique serial number

echo 02 > serial.txt

5. Sign the tomcat CSR

openssl x509 -CA server.crt -CAkey server.key -CAserial serial.txt -req -in tom.csr -out tom.cer -days 365

Signature ok
subject=/C=Unknown/ST=Unknown/L=Unknown/O=Unknown/OU=Unknown/CN=Unknown
Getting CA Private Key

6. Import the server CA certificate into the keystore

keytool -import -alias serverCA -file server.crt -keystore tom.ks

Enter keystore password:
Owner: O=My Company Ltd, L=Newbury, ST=Berkshire, C=GB
Issuer: O=My Company Ltd, L=Newbury, ST=Berkshire, C=GB
Serial number: ee13c90cb351968b
Valid from: Thu May 19 02:12:51 EDT 2011 until: Fri May 18 02:12:51 EDT 2012
Certificate fingerprints:
MD5: EE:F0:69:01:4D:D2:DA:A2:4E:88:EF:DC:A8:3F:A9:00
SHA1: 47:97:72:EF:30:02:F7:82:BE:CD:CA:F5:CE:4E:ED:89:73:23:4E:24
Signature algorithm name: SHA1withRSA
Version: 1
Trust this certificate? [no]: yes
Certificate was added to keystore

7. Add the tomcat certificate to the keystore

keytool -import -alias tom -file tom.cer -keystore tom.ks

Enter keystore password:
Certificate reply was installed in keystore

To configure a secure (SSL) HTTP connector for Tomcat, verify that it is activated in the $TOMCAT_HOME/conf/server.xml file. Edit this file and add the following lines.

Tomcat configuration

1. Edit the given portion of tomcat configuretion file and change the port as 80

vim /usr/tomcat/apache-tomcat-6.0.13/conf/server.xml
“””””” <Connector port="8080" protocol="HTTP/1.1"
                    connectionTimeout="20000"
                      redirectPort="8443" /> “”””””

              <Connector port="80" protocol="HTTP/1.1"
                    connectionTimeout="20000"
                      redirectPort="8443" />

2. Add the given portion to server.xml and give your password in the password portion


<Connector port="443" protocol="HTTP/1.1" SSLEnabled="true"
      maxThreads="150" scheme="https" secure="true"
                 keystoreFile="tom.ks"
                keystorePass="password"
           clientAuth="false" sslProtocol="TLS" />

When you start the Tomcat Your web server will match your issued SSL Certificate to your Private Key. Your web server will then be able to establish an encrypted link between the website and your customer’s web browser.

Start the tomcat with SSL Certificate

1. Restart tomcat

/usr/tomcat/apache-tomcat-6.0.13/bin/shutdown.sh
/usr/tomcat/apache-tomcat-6.0.13/bin/startup.sh

2. Go to https://Public DNS name:443/

Then your browser shows a security issue. Click the Approve button. Then you can enter to the tomcat with your certificate. When a browser connects to a secure site it will retrieve the site’s SSL Certificate and check that it has not expired, it has been issued by a Certification Authority the browser trusts, and that it is being used by the website for which it has been issued. If it fails on any one of these checks the browser will display a warning to the end user letting them know that the site is not secured by SSL.

You are Done !!!