• Call: +1 (858) 429-9131

Posts Tagged ‘Search Engines’

It’s Phab! That makes your life easier

We have been using plenty of different tools for tracking bugs/product management/project management/to do lists/code review; such as ClearCase, ClearQuest, Bugzilla, Github, Asana, Pivotal Tracker, Google Drive etc. We found Phabricator as a “Too Good To Be True” software engineering web application platform originally developed at Facebook. It has code review, wiki, repository browsing,tickets and a lot more to make Phab more fabulous.

Phabricator is an open source collaboration of web applications which help software companies to build better software. It is a suite of applications. Following are the most important tools in phabricator :
Maniphest – Bug tracker/task management tracker
Diffusion- source code browser
Differential – code review tool that allows developers to easily submit reviews to one another via command line tool when they check in code using Git or Subversion
Phriction – wiki tool

How to setup and configure the code review and project management tool – Phabricator

Installation

Server – 4GB Digital ocean droplet
OS – Ubuntu 14.04

1. Install dependencies

apt-get install mysql-server apache2 dpkg-dev php5 php5-mysql php5-gd php5-dev php5-curl php-apc php5-cli php5-json

2. Get code

#cd /var/www/codereview

git clone https://github.com/phacility/libphutil.git

git clone https://github.com/phacility/arcanist.git

git clone https://github.com/phacility/arcanist.git

3. Configure virtual host entry

#add below lines

#######################################################################

DocumentRoot /var/www/codereview/webroot
RewriteEngine on
RewriteRule ^/rsrc/(.*) – [L,QSA]
RewriteRule ^/favicon.ico – [L,QSA]
RewriteRule ^(.*)$ /index.php?__path__=$1 [B,L,QSA]
Order allow,deny
allow from all
#######################################################################
4. Enable the virtual host entry for phabricator.

# a2ensite phabricator.conf
# service apache2 reload

5. Configure the MySQL database configuration for phabricator

– create database
# /var/www/codereview/phabricator/bin/config set mysql.user mysql_username
# /var/www/codereview/phabricator/bin/config get mysql.pass mysql_password
# /var/www/codereview/phabricator/bin/config get mysql.host mysql_host
# /var/www/codereview/phabricator/bin/config storage upgrade
-tweak mysql

Open /etc/mysql/my.cnf and add the following line under [mysqld] section:

sql-mode = STRICT_ALL_TABLES

#service mysql restart

Set the Base URI of Phabricator install

# /var/www/codereview/phabricator/bin/config set phabricator.base-uri

(eg: phabricator.your-domain.com)

Configure Outbound Email – External SMTP (Google Apps)

Set the following configuration keys using /var/www/codereview/phabricator/bin/config set value

– metamta.mail-adapter -> PhabricatorMailImplementationPHPMailerAdapter
– phpmailer.mailer -> smtp
– phpmailer.smtp-host -> smtp.gmail.com
– phpmailer.smtp-port -> 465
– phpmailer.smtp-user -> Your Google apps mail id
– phpmailer.smtp-password -> set to your password used for authentication
– phpmailer.smtp-protocol -> ssl

Start the phabricator daemons

You can start all the phabricator deamons using the script
# /var/www/codereview/phabricator/bin/phd start
To start daemons at the boot time, add this entry to the file /etc/rc.local

/var/www/codereview/phabricator/bin/phd start

Diffusion repository hosting with git

1. Install git

#apt-get install git

2. Create a local repository directory:

#mkdir -p /data/repo

3. Edit the repository.default-local-path key to the new local repository directory.

Go to the Config -> Repositories -> repository.default-local-path

4. Configure System user accounts

Phabricator uses as many as three user accounts. These are system user accounts on the machine Phabricator runs on, not Phabricator user accounts.

* daemon-user – The user the daemons run as

We will configure the root user to run the daemons

* www-user – The user the web server run as

We will use www-data to be the web user

* vcs-user – The user that users will connect over SSH as

We will configure git user to the vcs-user

To enable SSH access to repositories, edit /etc/sudoers file using visudo to contain:

#includedir /etc/sudoers.d
git ALL=(root) SETENV: NOPASSWD: /usr/bin/git-upload-pack, /usr/bin/git-receive-pack, /usr/bin/git

Since we are going to enable SSH access to the repository, ensure the following holds good.

– Open /etc/shadow and find the line for vcs-user, git.

The second field (which is the password field) must not be set to !!. This value will prevent login. If it is set to !!, edit it and set it to NP (“no password”) instead.

– Open /etc/passwd and find the line for the vcs-user, git.
The last field (which is the login shell) must be set to a real shell. If it is set to something like /bin/false, then sshd will not be able to execute commands. Instead, you should set it to a real shell, like /bin/sh.

– Use phd.user as our daemon user;
# /var/www/phab/phabricator/bin/config phd.user root
# /var/www/phab/phabricator/bin/config set diffusion.ssh-user git

5. Configuring SSH

We will move the normal sshd daemon to another port, say 222. We will use this port to get a normal login shell. We will run highly restrictive sshd on port 22 managed by Phabricator.

Move Normal SSHD

– make a backup of sshd_config before making any changes.

#cp /etc/ssh/sshd_config /etc/ssh/sshd_config.backup

– Update /etc/ssh/sshd_config, change the port to some othert port like 222.

Port 222

– Restart sshd and verify that you are able to connect to the new port

ssh -p 222 user@host

Configure and start Phabricator SSHD

We now configure and start a second SSHD instance which will run on port 22. This instance will use special locked down configuration that uses Phabricator to handle the authentication and command execution.

– Create a phabricator-ssh-hook.sh file

– Create a sshd_phabricator config file

– Start a copy of sshd using the new configuration

Create phabricator-ssh-hook.sh: Copy the template in phabricator/resources/sshd/ phabricator-ssh-hook.sh to somewhere like /usr/lib/phabricator-ssh-hook.sh and edit it to have the correct settings

##############################################################

#!/bin/sh

# NOTE: Replace this with the username that you expect users to connect with.
VCSUSER=”git”

# NOTE: Replace this with the path to your Phabricator directory.
ROOT=”/var/www/codereview/phabricator”

if [ “$1” != “$VCSUSER” ];
then
exit 1
fi

exec “$ROOT/bin/ssh-auth” $@
##############################################################

Make it owned by root and restrict editing;

#sudo chown root /usr/lib/phabricator-ssh-hook.sh
#chmod 755 /usr/lib/phabricator-ssh-hook.sh

Create sshd_config for Phabricator: Copy the template in /phabricator/sshd/sshd_config.phabricator.example to somewhere like /etc/ssh/sshd_config.phabricator

Start Phabricator SSHD

#sudo /usr/sbin/sshd -f /etc/ssh/sshd_config.phabricator

Note:-
Add this entry to the /etc/rc.local to start the daemon on startup.

If you did everything correctly, you should be able to run this;

#echo {} | ssh git@phabricator.your-company.com conduit conduit.ping

and get a response like this;

{“result”:”phab-server”,”error_code”:null,”error_info”:null}

You should now be able to access your instance over ssh on port 222 for normal login and administrative purposes. Phabricator SSHD runs on port 22 to handle authentication and command execution.

6. To create a git repository

Go to Diffusion -> New Repository -> Create a New Hosted Repository

Upgrade Phabricator

Since phabricator is under development, you should update frequently. To update phabricator:

– Stop the web server
– Run git pull in libphutil/, arcanist/, and phabricator.
– Run phabricator/bin/storage upgrade.
– Restart the web server.
Also you can use a script similar to this one to automate the process:
http://www.phabricator.com/rsrc/install/update_phabricator.sh

Apache on the Cloud – The things you should know

    LAMP forms the base of most web applications.  As the load on an server increases, the bottlenecks in the underlying infrastructure become more apparent in the form of slow response to user requests.

     To overcome this slow response  the primary choice of most people is to add more hardware resources ( incase of AWS increasing the instance type). This will definitely  increases performance but will cost you more money.  The webserver and database eat most of the resources. Most commonly used web server is apache and database is MySQL. So if we can optimize these two we can improve the performance.

   Apache optimization techniques can often provide significant acceleration boosts  even when other acceleration techniques are in use, such as a CDN.  mod_pagespeed is a module from Google for Apache HTTP Servers that can improve the page load times of your website. you can read more on this from here.  If you want to deploy a PHP app on AWS Cloud, Its better to using some kind of caching mechanism.  Its already discussed in our blog .

      Once we came into a situation where we have to use a micro instance for a web server with less than 500 hits a day

      When the site started running live, and we feel like disappointed. when accessing website, it would sometimes pause for several seconds before serving the requested page. It took  hours to figure out what was going on. finally we run the command top and quickly discovered that when the site was accessing by certain amount of users the CPU would spike, but the spike was not the typical user or system CPU. For testing what’s happening in  server we used the apache benchmark tool ‘ab’ and run the following command on  localhost.

                                             #ab -n 100 -c 10 http://mywebserver.com/

      This will show  how fast our web server can handle 100 requests, with a maximum of 10 requests running concurrently. In the meantime we were monitoring the output of top command on web server.

     For further investigation we started with  sar – Linux command to  Collect, report, or save system activity information

  #sar 1

      According to amazon documentation “Micro instances (t1.micro) provide a small amount of consistent CPU resources and allow you to increase CPU capacity in short bursts when additional cycles are available”.

       If you use 100% CPU for more than a few minutes, Amazon will “steal” CPU time from the instance, meaning that they throttle your instance.  This last  as long as five minutes, and then you get a few seconds of 100% again, then the restrictions are back.  This will effect your website, making it slow, and even timing-out requests. basically means the physical hardware is busy and the hypervisor can’t give the VM the amount of CPU cycles it wants.

   Real tuning required on prefork. This is where we can tell apache to only generate so many processes. The defaults values  are high, and which cant be handled by micro instance. Suppose you get 10 concurrent requests for a php page and require around 64MB of RAM when requested (you have to make sure that  php memory_limit is above that value). That’s around 640MB of RAM on micro instance of 613MB RAM.  This is the case  with 10 connections – apache is configured to allow 256 clients by default,  We need to  scale these down , normally with 10-12 MaxClients. As per out case, this is still a huge number because 10-12 concurrent connections would use all our memory. If you want to be really cautious, make sure that your max memory usage is less than 613MB. Something like 64M php memory limit and 8 max clients keeps you under your limit with space to spare – this helps ensure that our MySQL process when your server is under load.

           Maxclients an important tuning parameter regarding the performance of the Apache web server. We can calculate the value of this for a t1.micro instance

Theoretically,

MaxClients =(Total Memory – Operating System Memory – MySQL memory) / Size Per Apache process.

t1.micro have a server with 613MB of Total memory. Suppose We are using RDS instead of mysql server.

Stop apache and run

#ps aux | awk ‘{sum1 +=$4}; END {print sum1}’.

 we will get the amount of memory thats used by processes other than apache.

Suppose we get a value around 30.

from top command we can check the average memory that each apache resources use.

suppose its 60mb.

Max clients = (613 – 30 ) 60 = 9.71 ~ 10 approx …

       Micro instances are awesome, especially when cost becomes a major concern, however that they are not right for all applications. A simple website with only a few hundreds  hits a day will do just fine since it will only need CPU in short bursts.

      For Servers that serves dynamic content, better approach is to employ a reverse-proxy. This would be done this apache’s mod_proxy or Squid. The main advantages of this configurations are content caching, load balancing etc. Easy method is to use mod_proxy and the ProxyPass directive to pass content to another server. mod_proxy supports a degree of caching that can offer a significant performance boost. But another advantage is that since the proxy server and the web server are likely to have a very fast interconnect, the web server can quickly serve up large content, freeing up a apache process, why the proxy slowly feeds out the content to clients

If you are using ubuntu, you can enable module by

                                        #a2enmod proxy

                                        #a2enmod proxy_http    

and in apache2.conf

                                         ProxyPass  /  http://192.168.1.46/

                                         ProxyPassReverse  /   http://192.168.1.46/

         The ProxyPassreverse directive captures the responses from the web server and masks the URL as it would be directly responded by the Apache  hiding the identity/location of the web server. This is a good security practice, since the attacker won’t be able to know the ip of our web server.

      Caching with Apache2 is another important consideration.  We can configure apache  to set the Expires HTTP header, max-age directive of the Cache-Control HTTP header of static files ,such as images, CSS and JS files, to a date in the future so that these files will be cached by your visitors browsers. This saves bandwidth and makes web site appear faster if a user visits your site for a second time, static files will be fetched from the browser cache

                                      #a2enmod expires

  edit  /etc/apache2/sites-available/default

  <IfModule mod_expires.c>
               ExpiresActive On
               ExpiresByType image/gif “access plus 4 weeks”
               ExpiresByType image/jpg “access plus 4 weeks”

</IfModule>

This would tell browsers to cache .jpg, .gif  files for four week.

       If your server requires a large amount of read / write operations, you might consider provisioned IOPS ebs volumes on your server. This is really effective if you use database server on ec2 instances.  we can use iostat on the command line to take a look at your read/sec and write/sec. You can also use CloudWatch metrics to determine read and write operations.

       Once we move to the security side of apache, our major concern is DDos attacks. If a server is under a DDoS attack, it is quite difficult to detect the attack before the damage is done.  Attack packets usually have spoofed source IP addresses. Hence, it is more difficult to trace them back to their real source. The limit on the number of simultaneous requests that will be served by Apache is decided by the MaxClients directive, and is set to safe limit, by default. Any connection attempts over this limit will normally be queued up.

     If you want to protect your apache against DOS,  DDOS attacks use mod_evasive module.  This module is designed specifically as a remedy for Apache DoS attacks. This module will allow you to specify a maximum number of requests executed by the same IP address. If the limit is reached, the IP address is blacklisted for the time period you specify.

Make your websites run faster, automatically — try mod_pagespeed for Apache

Google just released the first stable version of mod_pagespeed, the company’sopen-source module for Apache that can automatically optimize your  web pages to improve download and rendering speeds. With this release, Google is declaring this tool ready for broader adoption, though it’s worth noting that a number of large hosting providers like DreamHost, Go Daddy and content delivery network EdgeCast have already been using it in production for quite a while now.

“mod_pagespeed” speeds up your site and reduces page load time. This open-source Apache HTTP server module automatically applies web performance best practices to pages, and associated assets (CSS, JavaScript, images) without requiring that you modify your existing content or workflow.

FEATURES:-

1. Automatic website and asset optimization

2. Latest web optimization techniques

3. 40+ configurable optimization filters

4. Free, open-source, and frequently updated

5. Deployed by individual sites, hosting providers, CDN’s

How does mod_pagespeed speed up web-sites?

“mod_pagespeed” improves web page latency and bandwidth usage by changing the resources on that web page to implement web performance best practices. Each optimization is implemented as a custom filter in mod_pagespeed, which are executed when the Apache HTTP server serves the website assets. Some filters simply alter the HTML content, and other filters change references to CSS, JavaScript, or images to point to more optimized versions.

“mod_pagespeed” implements custom optimization strategies for each type of asset referenced by the website, to make them smaller, reduce the loading time, and extend the cache lifetime of each asset. These optimizations include combining and minifying JavaScript and CSS files, inlining small resources, and others. mod_pagespeed also  dynamically optimizes images by removing unused meta-data from each file, resizing the images to specified dimensions, and re-encoding images to be served in the most efficient format available to the user.

“mod_pagespeed” ships with a set of core filters designed to safely optimize the content of your site without affecting the look or behavior of your site.   In addition, it provides a number of more advanced filters which can be turned on by the site owner to gain higher performance improvements.

“mod_pagespeed” can be deployed and customized for individual web sites, as well as being used by large hosting providers and CDN’s to help their       users improve performance of their sites, lower the latency of their pages, and decrease bandwidth usage.

Installing mod_pagespeed on CentOS (cPanel/WHM)

  1. root@server1# cd /usr/src 
  2. root@server1[/usr/src]# mkdir mod_pagespeed/
  3. root@server1[/usr/src]# cd mod_pagespeed
  4. root@server1[/usr/src/mod_pagespeed]# wget https://dl-ssl.google.com/dl/linux/direct/mod-pagespeed-beta_current_x86_64.rpm
  5. root@server1[/usr/src/mod_pagespeed]# rpm2cpio mod-pagespeed-beta_current_x86_64.rpm | cpio -idmv
  6. root@server1[/usr/src/mod_pagespeed]# cp usr/lib/httpd/modules/mod_pagespeed.so /usr/local/apache/modules
  7. root@server1[/usr/src/mod_pagespeed]# chmod 755 /usr/local/apache/modules/mod_pagespeed.so
  8. root@server1[/usr/src/mod_pagespeed]# mkdir -p /var/mod_pagespeed/{cache,files} —–> Create pagespeed directories.
  9. root@server1[/usr/src/mod_pagespeed]# chown nobody:nobody /var/mod_pagespeed/*
  10.  root@server1[/usr/src/mod_pagespeed]# /usr/local/apache/bin/apxs -c -i /home/cpeasyapache/src/httpd-2.2.22/modules/filters/mod_deflate.c  —-> Enable mod_deflate (required for mod_pagespeed)
  11. root@server1[/usr/src/mod_pagespeed]# vim /usr/local/apache/conf/pagespeed.conf —>edit the mod_pagespeed configuration file

In this file include    

                                  1. LoadModule pagespeed_module modules/mod_pagespeed.so

                                  2. LoadModule deflate_module modules/mod_deflate.so

                                  3. ModPagespeedFileCachePath “/var/mod_pagespeed/cache/”

                                  4. ModPagespeedGeneratedFilePrefix “/var/mod_pagespeed/files/”

And then open /usr/local/apache/conf/includes/pre_main_global.conf and add:

Include conf/pagespeed.conf

# Rebuild Apache config and restart apache.

/scripts/buildhttpdconf

/etc/init.d/httpd restart

Once your web server fires up, it’ll be mod_pagespeed-enabled.

You can verify it by using any web-page test tool. Here I am using Pingdom tool. I have share the screenshots of with and without mod_pagespeed module.

 

Website without mod_pagespeed module

 

 

Website with mod_pagespeed module

 

 

 

 

From CAP, Puppet Now Chef, Evolution of Configuration Management Tools

CHEF, PUPPET & CAPISTRANO are used basically for two purposes  :

Application Deployment is all of the activities that make a software system available for use.

Configuration Management is software configuration management is the task of tracking and controlling changes in the software. Configuration management practices include revision control and the establishment of baselines.

Let me enlighten on how we evolved from the beginning when we were using tools like ssh, scp to the point where we began to abstract and began to equip our-self with these sophisticated yet simple to use tools. Earlier the following tools like

  • ssh which is used as a configuration management solution for admins.
  • scp act as a secure channel for application deployment.

The need for any other tools was out of question until things got complicated!!!

HISTORY

Earlier an Application Deployment  was just a few steps away such as

  1. scp app to production box
  2. restart server (optional)
  3. profit

And these software refreshing/updates were done

  1. Manual (ssh)
  2. with shell scripts living on the servers
  3. or not done at all

CAPISTRANO
(Introduced by Jamis Buck, written in Ruby, initially for Rails project)

Capistrano is a developer tool for deploying web applications. It is typically installed on a workstation, and used to deploy code from your source code management (SCM) to one, or more servers.In its sim­plest form, Capis­trano al­lows you to copy code from your source con­trol repos­i­tory (SVN or Git) to your server via SSH, and per­form pre & post-de­ploy func­tions like restart­ing a web­server, bust­ing cache, re­nam­ing files, run­ning data­base mi­gra­tions and so on.

Nice things cap introduced :

  1. Automate deploys with one set of files
  2. The files don’t have to live on the production server
  3. The language (Ruby) allows some abstraction

Now application deployment step can be coded and tested like rest of the project. It has also become the de facto way to deploy the Ruby on Rails applications. It has also had tools like webistrano build on top of it to provide a graphical interface to the command line tool.

Drawback : The tool seems to be widely used but not well supported.

PUPPET

(Written in Ruby and evolved from cfengine)

Luke Kanies came up with the idea for Puppet in 2003 after getting fed up with existing server-management software in his career as a systems administrator. In 2005 he quit his job at BladeLogic, a maker of data-center management software, and spent the next 10 months writing code to automate the dozens of steps required to set up a server with the right software, storage space, and network configurations. The result: scores of templates for different kinds of servers, which let systems administrators become, in Kanies’s metaphor, puppet masters, pulling on strings to give computers particular personalities and behaviors. He formed Puppet Labs to begin consulting for some of the thousands of companies using the software—the list includes Google, Zynga, and Twitter etc

Puppet is typically used in a client server formation, with all your clients talking to one or more servers. Each client contacts the servers periodically (every half an hour by default), downloads the latest configuration and makes sure it is sync with that configuration.

The Server in Puppet is called Puppet Master.
Puppet Manifests contains all the configuration details which are declarative as opposed to imperative.

The DSL is not Ruby as you are not writing scripts you are writing definitions, Install order is determined through dependencies.
The Puppet Master is idempotent which will make sure the client machines match the definitions.This is good as you can implement changes across machines automatically just by updating the manifest in the Puppet Master.

CHEF
(written in ruby evolved from puppet)

CHEF is an open source configuration management tool using pure-Ruby, the chef domain specific language for writing system configuration related stuff (recipes and cookbook)
CHEF brings a new feel with its interesting naming conventions relating to cookery like Cookbooks (they contain codes for a software package installation and configuration in the form of Recipes), Knife (API tool), Databags (act like global variables) etc

Chef Server – deployment scripts called Cookbooks and Recipes, configuration instructions called Nodes, security details etc. The clients in the chef infrastructure are called Nodes. Chef recipes are imperative as opposed to declarative. The DSL is extended Ruby so you can write scripts as well as definitions. Install order is script order NO dependency checking.

CHEF & PUPPET

Chef and Puppet automatically set up and tweak the operating systems and programs that run in massive data centers and the new-age “cloud” services, designed to replace massive data centers.

Chef Recipes is more programmer friendly as it is easily understood by a developer unlike a Puppet Manifest.

And when it comes to features in comparison to puppet, chef is rather more intriguing .
For example “Chef’s ability to search an environment and use that information at run time is very appealing.

Knife is Chef’s powerful command line interface. Knife allows you to interact with your entire infrastructure and Chef code base. Use knife to bootstrap a server, build the scaffolding for a new cookbook, or apply a role to a set of nodes in your environment. You can use knife ssh to execute commands on any number of nodes in your environment. knife ssh + search is a very powerful combination.

The part of defining dependencies in Puppet was overly verbose and cumbersome. With Chef, order matters and dependencies would be met if we specified them in the proper order.

We can deploy additional software applications on virtual machine instances without dealing with the overhead of doing everything manually,” Stowe explains. “We can do it with code — recipes that define how various applications and libraries are deployed and configured.” According to Stowe, creating and deploying a new software image now takes minutes or hours rather than hours or weeks. They call this technique DevOps because it applies traditional programming techniques to system administration tasks. “It’s just treating IT operations as a software development problem, – Stowe, CEO of Cycle Computing, a Greenwich, Connecticut-based start-up that uses Chef to manage the software underpinning the online “supercomputing” service it offers to big businesses and academic outfits. “Before this, there were ways of configuring servers and managing them, but DevOps has gotten it right.”

Lets CATEGORIZE

Let me help you to know onto which buckets does the above tools fell into and other similar tools…

App Deploy Capistrano, ControlTier, Fabric, Fun, mCollective
SysConfig Chef, Puppet, cfengine, Smart Frog, Bcfg2
Cloud/VM Xen, Ixc, openVZ, Eucalyptus, KVM
OS Install Kickstart, Jumpstart, Cobbler, OpenQRM, xCAT

DevOPS on AWS Cloud using Opscode Chef

Rule the Cloud‘ with Chef
Chef is Infrastructure as Code,an API for your entire infrastructure. Assuming that you are well versed with cloud if not still you should have atleast heard of cloud computing and it is still an evolving paradigm and Cloud computing companies are the newest buzz in the IT sector. Chef is used in conjunction with cloud  from cloud providers say Amazon’s AWS. If a software thats being developed is a mix of technology which is interdependent and works in perfect harmony then why not the people behind it, this thought has led to the emergence of a new cultral trend called DevOPS. Now if you setup a number of instances on the cloud then whats next – new instances on cloud are just like bare metal server and the configuration has to be done from scratch and it would be feasible to do so manually for couple of them what if the count just got bigger say 100 live instances with different unix distros, although a script could be written but still it will not suffice,  in the long run considering management too. Here the CHEF comes into play

“chef is sysadmin robot performing configuration tasks automatically and much more quickly than a single admin could ever hope to” – Jesse Robbins, Opscode CEO.

CHEF is an open source configuration management tool using pure-Ruby,the chef domain specific language for writting system configuration related stuff (recipes and cookbook)

CHEF brings a new feel with its interesting naming conventions relating to cookery like Cookbooks (they contain codes for a software package installation and configuration in the form of Recipes), Knife (API tool), Databags (act like global variables) etc

Although there are many configuration management tools prevailing in the industry CHEF was able to secure its position in the race.

“CHEF take a step farther passes puppet and cfengine — like doing “LIVE SEARCH” within  configuration management like loadbalancer can call out to get a list of the app servers you need to balance  or an applicaton server can call out, get a reference to the master database server  etc …..the centralised chef server is indexing all the information about your infrasturctre  so that you could search in the command line using knife you know in real time so that application could lever that data..” by Seth Chisamore from the OPSCODE.

A techonology peak that isnt fluffy – Cloud
For those folks new to cloud- Its a whole bunch of activites which began as an innovation, recently given out as products and now they have become so widespread and so feature complete that they became suitable for utility services.

So if you dont want cloud in your business its like saying you dont want to use the electricity instead you built your own generator and use it according to your need. Now what do we loose if we continue with that is the competitive edge ie you get the pressure to keep your stuff upgraded inorder to find your place relative to the others in the ecosystem.

Cloud is API oriented, everything you see in cloud is ulitmately programmable.

Virtualization is the foundation of Cloud but virtualization is not Cloud by itself. It certainly enables many of the things we talk about when we talk Cloud but it is not necessary sufficient to be a cloud. Google app engine is a cloud that does not incorporate virtualization. One of the reasons that virtualization is great is because you can automate the procurement of new boxes.

A Culture thats on path to revolutionize IT – DevOPS
Devops is something that orginated in webshops predominantly and it require a kind of tools thats really not available except for home grown tools which the big webshops built over and over again. So the organisation who wanted to use devops started using the tools that enable this transition as most organisations depends on web as a source of revenue in a variety of different ways, even the enterprise desire to be as agile as the webshops. This has begun a revolution from the website permeate into the enterprise base more frequently.

Considering a real life example for Devops say facebook, the most popular social networking site here the developers/QA/operations – there is alot of communications, cross talk happening between them like the developers has to write codes, QA who has to make sure the good code goes out, the operations team has to make sure its up and running. Finally all of these has to be in records which altogether seems to be inefficient, this led to the evolving of the entire system. According to the conventional practices where the developers writes the code and throws it off to the testing. Once the testing is done then it moves to the operations etc. Contrary to that the developers , operations team are all involved in the entire lifecycle of the project as a team. This creates a symbiotic relationship. Now the operations people could understand what the engineers needs the most and the developers are able to see the value that operation people brings as they make architecture decisions.

Cloud with your DevOps offers some fantastic properties. The ability to leverage all the advancements made in software development around repeatability and testability with your infrastructure. The ability to scale up as need be real time (autoscaling) and among other things being able to harness the power of self healing systems. DevOps better with Cloud.

Configuration management say CHEF is one of the most fundamental elements allowing DevOps in the cloud. It allows you to have different VMs that have just enough OS that they can be provisioned, automatically through virtualization, and then through configuration management can be assigned to a distinct purpose within the cloud. The CM system handles turning the lightly provisioned VM into the type of server that it is intended to be.

DevOps & Chef
DevOps is nonthing but a cultural movement where everybody say the developers, QA, Operations, Testing etc get along. A project group formation with a mixed skillset that blurs the line between say a developer and sysadmin. This helps the project to meet its deadlines
and avoid unexpected situations. Cloud computing act like a catalyst to this movement. Thereby the CHEF also hops in.

Chef forms a critical layer in the Devops stack.Thanks to the concept of infrastructure as code and virtualization, we can define and build our infrastructure based on text files. Those files can be version-controlled and tested like regular code. The artifact (ami, image), can then be deployed on an infrastructure. The following image gives you an overview on the similarities.

Inadvertently the issues like “what if the application” or “what if the infrasturcture” are resolved, the fact is that application is the infrastructure and infrastructure is the application and we are here to enable business, also it helped bring peoples in the team into better alignment across the board.

Chef configuration is written in pure ruby.

Devops == Ruby

For those who think Bash is enough as a scripting language – Bash becomes a liability not an asset once your script exceeds 100 lines and a total nightmare if you need to parse or output HTML, CSV, XML, JSON, etc. A significant point to be noted is that Chef uses Ruby in its recipes unlike puppet where it uses its own configuration language that is based on Ruby although chef is heavily inspired from puppet. If you chose chef then you are effectively scripting your infrastructure with ruby.

Though Chef was only released on January 15th , 2009 it has gotten rapid adoption and gained a large number of contributors. According to the Opscode wiki there are 545 approved contributors to Opscode projects and 106 companies. Beyond that the #chef IRC channel is typically attended by over 100 users and Opscode staff, signs of a healthy, growing open source community.

Springsource division of VMware have signed on to contribute to the project. They are even being very public about it as seen in this endorsement:

“We are excited about the open source contributions the Springsource Division of VMware has made to Opscode Chef.” said Javier Soltero, CTO of Springsource Management Products at VMware. “Chef is an important tool for automating infrastructure management and we look forward to its continued growth and success.”

Moreover on my experience of using chef I really enjoyed the quick response I could get from the Opscode Support Team for all my queries and they had always being able to direct me towards a solution.

Automation Using Chef to create an Instance on Amazon Cloud Service Provider with Apache webserver configured in it.

Memo
chef-workstation – is the place where we customize our cookbooks and maintains the chef-repo
chef node – is the management node that we create using chef, it configures itself based on its runlist and downloaded cookbooks

The really cool thing with Chef is that you can rerun cookbooks against a node and it will not do anything it has already done i.e it will not change the end result on the target node as defined by the recipes being run against it. So you will always get the same outcome no matter what state the node and actions will not be taken if already done (and conversely run if detected it has not been run).  When reading about Chef you will see this described as being idempotent (There I’ve saved you looking it up).

Prerequisites – an AWS account, EC2 API configured, OS – Ubuntu.

1. Sign up an account at http://www.opscode.com/hosted-chef/# , Here we use the OHC (opscode hosted chef) where we get to create upto 5 nodes for free!!

2.Verify your opscode account.

3.Download the files

Create an organization in the Console page at www.manage.opscode.com, and then download the following files:

  • Your Organization validation key. This is used to automatically register new Chef Clients (like servers you manage).
  • The Knife configuration file.
  • Your User key. This is used to authenticate your user with Hosted Chef.
  • Edit knife.rb  to add aws access key and secret access key
  • knife[:aws_access_key_id]     = “Your AWS Access Key”
  • knife[:aws_secret_access_key] = “Your AWS Secret Access Key”

At this stage I have a chef ready user environment, an OpsCode organisation set up and now I want to start by spinning up an ec2 instance. I will not be going into any depth regarding  the ec2 specifics as that would make this post far too long.

4.Setting Up chef-Workstation

Install Ruby and Development Tools

#sudo apt-get update
#sudo apt-get install ruby ruby-dev libopenssl-ruby rdoc ri irb build-essential wget ssl-cert git-core
#sudo gem update –system

Install RubyGems

#cd /tmp
#wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.10.tgz
#tar zxf rubygems-1.8.10.tgz
#cd rubygems-1.8.10
#sudo ruby setup.rb –no-format-executable

Install Chef

#sudo gem install chef

5.To verify chef installation

#chef-client -v

6.Build the chef repository

#cd ~
#git clone https://github.com/opscode/chef-repo.git

Knife reads configuration files in .chef. so we need to create those as well

#mkdir -p ~/chef-repo/.chef

Copy the keys and knife configuration you downloaded earlier into this directory:

#cp USERNAME.pem ~/chef-repo/.chef
#cp ORGANIZATION-validator.pem ~/chef-repo/.chef
#cp knife.rb ~/chef-repo/.chef

Run the following command to confirm knife is working with the Hosted Chef API.

#cd ~/chef-repo
#knife client list

output : “ORGANIZATION-validator”

7.Now i need to download the apache2 cookbook on to my workstation, customize if required and then upload it to my account on the opscode platform

#knife cookbook site install apache2

this will notify git and also pulls down the desired cookbook

8.Upload the cookbook using the following command

#knife cookbook upload apache2

9.Enter the following command, sit back and  enjoy the show!!!

#knife ec2 server create -G default -I ami-1212ef7b -f m1.small -S <aws ssh key id> -i <ssh identity file> -x root -r ‘recipe[apache2]’


Before proceeding it would probably be a good idea to take time out and read the Opscode  Chef Recipe wiki which has a nice clear explanation on cookbook name spaces. Also remind yourself of the components that make up a cookbook it’s worth noting that recipes manage resources and those resources will be executed in the order they occur.

HADOOP Cluster on AWS EC2 with hadoop-0.20 and ubuntu-10.04

Let’s start with a small introduction- what is hadoop ?. Hadoop is an open-source project administered by the Apache Software Foundation. Apache Hadoop is a Java software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google’s MapReduce and Google File System (GFS) papers.

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Dealing with big data requires two things:

  • Inexpensive, reliable storage; and
  • New tools for analyzing unstructured and structured data.

Hadoop creates clusters of machines and coordinates work among them. Clusters can be built with inexpensive computers.If one fails, Hadoop continues to operate the cluster without losing data or interrupting work, by shifting work to the remaining machines in the cluster.

HDFS manages storage on the cluster by breaking incoming files into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers.

The main services running in a hadoop cluster will be

1)namenode

2)jobtracker

3)secondarynamenode

These three will be running only on a single node(machine) ; that machine is the central machine which controls the cluster.

4)datanode

5)tasktracker

These two services will be running on all other nodes in the cluster.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

Above the file systems comes the MapReduce  engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs. The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible.

The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node.

Now Let us have a look at how to build a hadoop cluster using Cloudera hadoop-0.20 on ubuntu-10.04

You should install sun –jdk  first. Then add the following repositories to the apt sources list.

vim /etc/apt/sources.list.d/cloudera.list

[bash]

deb http://archive.cloudera.com/debian lucid-cdh3u0 contrib

deb-src http://archive.cloudera.com/debian lucid-cdh3u0 contrib

[/bash]

Import key

[bash]curl -s http://archive.cloudera.com/debian/archive.key | apt-key add -[/bash]

Then run

[bash]apt-get update[/bash]

For Namenode/Jobtracker ( These two services should run only on a single central machine in the cluster)

[bash]

apt-get install hadoop –yes

apt-get install hadoop-0.20-namenode

apt-get install hadoop-0.20-jobtracker

apt-get install hadoop-0.20-secondarynamenode

[/bash]

Configuration

vim /etc/hadoop/conf/hadoop-env.sh

Append these

[bash]

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.24/   ( your java home comes here )

export HADOOP_CONF_DIR=/etc/hadoop/conf

export HADOOP_HOME=/usr/lib/hadoop-0.20

export HADOOP_NAMENODE_USER=hdfs

export HADOOP_SECONDARYNAMENODE_USER=hdfs

export HADOOP_DATANODE_USER=hdfs

export HADOOP_JOBTRACKER_USER=mapred

export HADOOP_TASKTRACKER_USER=mapred

export HADOOP_IDENT_STRING=hadoop

[/bash]

vim /etc/hadoop/conf/core-site.xml

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://< ip address of this machine >:8020</value>

</property>

</configuration>

[/bash]

vim /etc/hadoop/conf/hdfs-site.xml

 

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/var/lib/hadoop-0.20/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/var/lib/hadoop-0.20/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

</configuration>

[/bash]

vim /etc/hadoop/conf/mapred-site.xml

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>< ip address of this machine >:8021</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/var/lib/hadoop-0.20/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/var/lib/hadoop-0.20/mapred</value>

</property>

</configuration>

[/bash]

——————————————————————————————————————————————

[bash]

mkdir  / var/lib/hadoop-0.20/name

mkdir  / var/lib/hadoop-0.20/data

mkdir  / var/lib/hadoop-0.20/system

mkdir  / var/lib/hadoop-0.20/mapred

chown -R hdfs /var/lib/hadoop-0.20/name

chown -R hdfs /var/lib/hadoop-0.20/data

chown -R mapred /var/lib/hadoop-0.20/mapred

[/bash]

Now format NameNode

[bash]yes Y | /usr/bin/hadoop namenode –format[/bash]

Start namenode

[bash]/etc/init.d/hadoop-0.20-namenode start[/bash]

Check the log Files for error:

less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-<ip>.log

Also you can check whether the Namenode process is up or not using the command

[bash]# jps[/bash]

Start the SecondaryNamenode

[bash]/etc/init.d/hadoop-0.20-secondarynamenode start[/bash]

Log: less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-<ip>.log

[bash]

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-0.20/system

sudo -u hdfs hadoop fs -chown mapred /var/lib/hadoop-0.20/system

[/bash]

Now Start the JobTracker

[bash]/etc/init.d/hadoop-0.20-jobtracker start[/bash]

Log : less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-ip-10-108-39-34.log

Now  jps  command will show the three processes up

# jps

19233 JobTracker

18994 SecondaryNameNode

18871 NameNode

For Datanode/Tasktracker ( These two services should be running on all the other machines in the cluster )

[bash]

apt-get install hadoop-0.20-datanode

apt-get install hadoop-0.20-tasktracker

[/bash]

Configuration

vim /etc/hadoop/conf/core-site.xml

 

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

&nbsp;

<!– Put site-specific property overrides in this file. –>

&nbsp;

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://< ip address of the namenode >:8020</value>

</property>

</configuration>

[/bash]

vim /etc/hadoop/conf/hdfs-site.xml

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

&nbsp;

<!– Put site-specific property overrides in this file. –>

&nbsp;

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/var/lib/hadoop-0.20/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/var/lib/hadoop-0.20/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

</configuration>

[/bash]

vim /etc/hadoop/conf/mapred-site.xml

[bash]

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

&nbsp;

<!– Put site-specific property overrides in this file. –>

&nbsp;

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>< ip address of jobtracker  >:8021</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/var/lib/hadoop-0.20/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/var/lib/hadoop-0.20/mapred</value>

</property>

</configuration>

[/bash]

———————————————————————————————————————————————

[bash]

mkdir  /var/lib/hadoop-0.20/data/

chown -R hdfs /var/lib/hadoop-0.20/data

mkdir /var/lib/hadoop-0.20/mapred

chown -R mapred /var/lib/hadoop-0.20/mapred

[/bash]

Start the DataNode

[bash]/etc/init.d/hadoop-0.20-datanode start[/bash]

Log : less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-datanode-<ip>.log

Start the TaskTracker

[bash]/etc/init.d/hadoop-0.20-tasktracker start[/bash]

Log: less /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-<ip>.log

You can now check the interface

http://< namenode-ip >:50070   – for HDFS overview

and

http://< jobtracker –ip>:50030  – for Mapreduce overview

IBM application servers on AWS

IBM and Amazon announced that they have teamed up to provide cloud computing solutions. With the announced now Amazon will be able to provide IBMs application servers in their cloud computing environment. Amazon is already proving Microsoft Winodws AMIs in their EC2 instances. With the solutions like IBM DB2, Informix, Webshephere flavours and Lotus Web content management solution both IBM and Amazon has strengthened their offerings in cloud computing offerings. Time will tell what Gogrid & Google AppEngine has in store to compete with AWS