Compute Infrastructure-as-a-Service
Today’s software development world is hosted on massive computing machines — lots of memory, lots of disk space, lots of CPU power. However, software development and testing is still often done at small scale; developers use vi, run unit tests running in python and run build scripts written for ant and mvn. How can one best use these massive machines for their development at small scale and still run tests on them at large scale, when necessary?
Unless you’ve been living in a cave the last few years, virtualization has been firmly burned in your mind by IT marketing material. In particular, taking those massive machines and cutting them up into many smaller virtual machines is the solution converged on by much of the industry. I agree! And, here are my notes on how I moved my group into this era.
CloudStack
Enter Citrix, Cloud.COM and now the Apache Foundation; CloudStack is an incubating Apache project. CloudStack is a very slick application which effectively implements Amazon’s EC2 UI, API and features — including a nice web front-end for starting and managing your VMs, storage and networks. However, as with all new software and certainly a piece of software which is as complex as a data center in a box, there are bugs and lots of knobs to turn for configuration.
Setup
I took a very conservative approach to working with CloudStack. I need only many VMs, running on the same network with little isolation and with only workable storage space and reliability. I do not need very high performance or high reliability. I only needed to slice up a few machines in the same physical datacenter. Further, I am currently using CloudStack 3.0.2 on CentOS 6.3 using the KVM hypervisor; as CloudStack development moves VERY quickly, I expect their upcoming 4.0 release will be very different and further, as the OS vendors do not stand still, I’m sure a different RedHat based distro or even CentOS version would be quite different.
I followed the CloudStack Quick Install guide and set up a Basic Zone. As I did want my VMs reachable from the outside (CloudStack that is) world, I needed to ensure I selected a network offering supporting Security Groups (DefaultSharedNetworkOfferingWithSGService).
Further, I reused another machine I had handy with a terabyte of disk space as my NFS server but did enable local storage for user VMs to stretch all the disks I could get at. I used one IP network for my management and guest networks. (I do hope to get the machines running bonded 1GigE soon for their physical connections though.) Simply to reduce IP usage, were I to do it again, I would have use a second non-routable (RFC1918 address space) and setup the management server to have acted as a NAT box to my broader network.
Configuration
CloudStack is much more centralized than some other infrastructure-as-a-server cloud offerings. One only needs to understand a few roles and daemons to understand the major touch points to CloudStack:
- Management Server
- This runs the Tomcat server which hosts the UI and does most of the coordination activities amongst the various CloudStack components.
- /etc/init.d/cloud-management
- /var/log/cloud/management/management-server.log
- /var/log/cloud/management/catalina.out
- Usage-Server
- This is the usage server which collects metrics from CloudStack for external analysis (e.g. billing)
- /etc/init.d/cloud-usage
- /var/log/cloud/usage/usage.log
- Agent
- This runs on the CloudStack machines which host guest VMs.
- /etc/init.d/cloud-agent
- /var/log/cloud/agent/agent.log
This centralization of components makes configuration and debugging an easier process but still managing all the how-to documents for a system as big as CloudStack became a bit daunting; below are my most used how-to’s and pitfalls which I ran across.
Agent Reboots
One issue which was very confusing, was when I initially setup my compute hardware as CloudStack agents. They would immediate reboot; and cause the machine to keep rebooting! (This was not behavior I was expecting.) This taught me to check the logs early and check the logs often, as I found (in /var/log/cloud/agent/agent.log):
2012-10-09 16:18:50,466{GMT} WARN [resource.computing.KVMHAMonitor] (Thread-27:) write heartbeat failed: Failed to create /mnt/031d9475-063d-30b5-b910-7ee710ff81b0/KVMHA//hb-172.20.7.136; reboot the host
Luckily, others had been here before. The fix was nicely documented and ever so easy:sed -i 's/reboot/#reboot/g' /usr/lib64/cloud/agent/scripts/vm/hypervisor/kvm/kvmheartbeat.sh. It also showed me an invaluable setting to enable outputting the DEBUG messages from the CloudStack agent: sed -i 's/INFO/DEBUG/g' /etc/cloud/agent/log4j-cloud.xml
Agent dies at start with: Unable to start agent: Unable to find the guid
Next, I had issues with starting the agent. The wizard or I would run cloud-setup-agent and then checking /etc/init.d/cloud-agent status would show the agent dead. This to was an easy fix which
again someone else had documented. One simply needs to add the following to their /etc/cgconfig.conf and restart their cfconfig service:
group virt { cpu { cpu.shares = 9216; } }
Set your hostname
While the Quick Install Guide says to ensure your hostname is set (e.g. checked via the hostname --fqdn command) ensure that you have /etc/hosts and /etc/sysconfig/network set with your fully-qualified hostname. One error you may see, can be found in the ever helpful CloudStack forum.
Automatic VM Password Generation
To add password generation and reset support to your own templates, you can follow the instructions for CloudStack 4.0; I have tested the Linux script, at least. (There is also the ability to use ssh key-pairs, like Amazon EC2 does, but I have not yet tried that but it is well documented, if not supported by the UI.)
LDAP
Setting up LDAP for CloudStack is quite easy but it requires doing some setting outside the UI, and with the API, as documented in the instructions (or original). (There are some notes on using port 8096, as the documentation does.) There is also one bug CS-14680 which has to be worked around, as the LDAP authentication does not use MD5 hashing like the built-in MySQL authentication does.
Due to CS-14680, if you need to allow authentication against both LDAP and the built-in MySQL, then a bit of HTML changes are necessary too. The changes are documented in CS-16325.
Lastly, as one needs to setup the accounts for CloudStack to use from LDAP, there is a Ruby script which can synchronize your LDAP server to CloudStack. But remember, if setting up accounts from a LDAP server which might control sensitive services (e.g. Active Directory) in my case, you will likely want to use SSL on your Management Server, so that passwords are encrypted.
Usage
Cleanly Restart a Host
If you need to restart one of your VM hosting machines, there is a bit more forethought required than one would normally have for a Linux box. The steps are:
- Mark it in maintenance
- Then, restart
- Mark it as available
If a machine is not properly shutdown:
- Get it back online by toggling the maintenance state of the host
- Look at zone’s system VMs — they might be in wedged starting state and need to be unstuck
- May need to enable/disable zone
- Restart the management server
- Ensure the VMs are not running on the host they claim (using virsh) and set them to stopped in MySQL)
Storage Migration
One very cool feature of CloudStack is that you can migrate your VMs (live!) and you can migrate the storage they are running upon too (storage migration). This is especially useful, if using local storage and needing to move a VM off for host maintenance; but beware there is a good performance optimization which is left to be made to lessen the load on secondary storage when do a storage migration.
Local Storage
Using local storage is of huge help if your infrastructure does not have much shared storage. However, if you are like me, it is easiest to create templates which have relatively small root disks (say 20GB) but for many needs, you will then need to attach the bulk of the storage as an extra volume. While there is a check box to have a system’s root disk be local, there is no equivalent for a disk offering (for making said extra volumes).
I tried to implement local storage disk offerings by using storage tags. I set a tag on the local primary storage pools with tag “LOCAL” and made a disk offering requiring the volumes to be made on pools with only with tag “LOCAL”, but this failed. I could create the volume (but that only makes a database record in CloudStack and does not actually pick out storage; when I attached the storage to the VM (and CloudStack would actually create the volume), it failed. I got:
2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Checking if storage pool is suitable, name: cloud0.domain ,poolId: 211 2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Is localStorageAllocationNeeded? false 2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) Is storage pool shared? false 2012-10-18 22:27:45,385 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-95:job-524) StoragePool is not of correct type, skipping this pool 2012-10-18 22:27:45,385 DEBUG [storage.allocator.FirstFitStoragePoolAllocator] (Job-Executor-95:job-524) FirstFitStoragePoolAllocator returning 0 suitable storage pools 2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitable pools found for volume: Vol[142|vm=106|ROOT] under cluster: 6 2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitable pools found 2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) No suitablestoragePools found under this Cluster: 6 2012-10-18 22:27:45,385 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-95:job-524) Could not find suitable Deployment Destination for this VM under any clusters, returning.
This was annoying, the volume I explicitly wanted local, CloudStack was trying to make shared and was ruling out the local pools! However, thankfully someone who was trying to run their cloud without any shared storage and hit upon a solution in CS-11840. (Despite the original filer claiming this failed; UPDATE disk_offering SET use_local_storage = 1 WHERE display_text LIKE "%LOCAL%"; worked for me on 3.0.2.) This did not solve the whole problem immediately, however, as I was trying with local VMs and local storage getting the following error:
2012-10-18 16:11:51,537 DEBUG [cloud.async.AsyncJobManagerImpl] (http-6443-exec-4:null) submit async job-440, details: AsyncJobVO {id:440, userId: 3, accountId: 3, sessionKey: null, instanceType: Volume, instanceId: 104, cmd: com.cloud.api.commands.AttachVolumeCmd, cmdOriginator: null, cmdInfo: {"response":"json","id": "f0089a1b-32f4-4e49-89fa-0dfe0935b4b4","sessionkey":"r2D/wVCutA/UwOkAXxtfzSDjU7o\u003d","ctxUserId":"3","virtualMachineId":"fe907255-063a-4e72-95a3-43abe53f1867 ","_":"1350591111196","projectid":"6c8ef680-752f-47d4-a0a1-fe9d68197a18","ctxAccountId":"3","ctxStartEventId":"4420"}, cmdVersion: 0, callbackType: 0, callbackAddress: null, status: 0, processStatus: 0, resultCode: 0, result: null, initMsid: 964251491601, completeMsid: null, lastUpdated: null, lastPolled: null, created: null} 2012-10-18 16:11:51,538 DEBUG [cloud.async.AsyncJobManagerImpl] (Job-Executor-11:job-440) Executing com.cloud.api.commands.AttachVolumeCmd for job-440 2012-10-18 16:11:51,721 INFO [cloud.api.ApiDispatcher] (Job-Executor-11:job-440) Please specify a volume that has been created on a shared storage pool.
But. once I realized that I do not care if the small root disk is shared, as long as the massive storage volume is local, I had success with shared storage VMs having local volumes.
Deleting a Zone
If you need to delete a zone, you will want to make sure to follow the correct steps to delete the zone or it is possible to get the zone wedged in an un-deletable state (e.g. CS-14297: [Can] not delete primary storage without going into the database). The steps and correct order are outlined in CS-15991.
Troubleshooting
Much troubleshooting in is done via investigating the MySQL database which underlies Cloudstack, sometimes things get wedged enough that they require changes. The database schema is very easy to understand. While one needs to be restrained in modification (as referential integrity can be compromised causing confusion, if a row is removed or an incorrect ID is entered) there are constraints running around to try and prevent errant state.
Storage Issues
Storage issues can be some of the most insidious issues one will encounter with CloudStack. Errors can be bizarre! Issues I ran across with templates alone:
- Templates are listed in the UI under “Templates” but not visible when I go to create a VM
- Usually you can see that the template is still downloading or had an error when clicked on in the UI under templates.
- These issues can often be resolved by verifying the Secondary Storage VM (SSVM) is working okay. First, make sure your SSVM even started by going to Infrastructure->Zones->System VMs and ensuring your SSVM is “Running”. Luckily, there is a nice write-upon how to check the SSVM for its other common sicknesses.
- Deleting a template or a template is wedged with Failed post download script: Checksum failed, not proceeding with install fixed in CS-14555. This happened to me on the built-in CentOS 5.6 VM template, which I simply wanted to remove, but since wedged had no UI option to remove it. As such, I followed the (now slightly outdated) steps in a forum thread.
- Uploading a template/ISO fails
- Get “Connection Refused” as a template status trying to upload but this is by design that templates can only be uploaded from accepted sites. I simply had to change the secstorage.allowed.internal.sitesconfiguration variable to allow the host.
- Got “Please specify a valid qcow2” uploading a template may fail due to the file name not ending in .qcow2.
- Trying to upload an ISO kept failing for me reporting java.lang.IllegalStateException: java.lang.IllegalStateException: unsupported protocol: 'ftp' with an http:// URL, so I filed a bug CLOUDSTACK-370 which awesomely got two responses before I could even git clone the CloudStack source and reproduce the issue.
One painful issue I encountered was when my CloudStack hosts acquired hostnames in DNS, a few days after the cloud was setup. The NFS server providing my primary storage had the cloud machine’s IP addresses in /etc/exports ACL to ensure they could mount, write with no root squashing. But when the hosts entered DNS, the server started rejecting their mount requests and write updates too! (I was using a wild-card for the hosts’ IP addresses for access control which may explain why the hosts were rejected after acquiring hostname (this NFS guide provide’s Do not use wildcards in IP addresses, as they are intermittent in IP addresses..) This lead to the quizzical error, when trying to start a new VM:
2012-10-18 22:18:06,516 DEBUG [storage.allocator.AbstractStoragePoolAllocator] (Job-Executor-93:job-522) Cannot allocate this pool 207 for storage since its allocated percentage: Infinity has crossed the allocated pool.storage.allocated.capacity.disablethreshold: 0.85, skipping this pool
Luckily, again I was not the first to encounter this weird issue with infinity issue; indeed like the previous poster, my MySQL database had 0 for the allocated and available bytes for my primary storage and after fixing the /etc/exports all was happy.
Addding Hosts
For a few quick tests, I had removed one of my agent nodes and came across this strange issue when trying to add the host back to the original zone:
libvir: Storage error : Storage pool not found: no pool with matching uuid 2012-10-15 03:52:00,287{GMT} WARN [utils.nio.Task] (Agent-Handler-1:) Caught the following exception but pushing on java.lang.NullPointerException at com.cloud.agent.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:504) at com.cloud.agent.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:57) at com.cloud.agent.resource.computing.LibvirtComputingResource.initialize(LibvirtComputingResource.java:2978) at com.cloud.agent.Agent.sendStartup(Agent.java:316) at com.cloud.agent.Agent$ServerHandler.doTask(Agent.java:846) at com.cloud.utils.nio.Task.run(Task.java:79) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679)
This had me puzzled for a bit, as I had not needed to use virsh(1) much before this adventure, It seems the old storage pools were still present which was preventing their being re-created. I believe the error was akin to:
[clayb@cloud_machine ~]$ sudo virsh pool-create /tmp/test_pool.xml error: Failed to create pool from /tmp/test_pool.xml error: operation failed: Storage source conflict with pool: '031d9475-063d-30b5-b910-7ee710ff81b0'
Eventually, I did the following to good success:
- Checked if virsh reported any storage pools in existence (since this host was not successfully added, it should not have had any) — virsh pool-list
- Ensured all pools were destroyed with virsh pool-remove <pool>
- Cleaned-up any residual files in /etc/libvirt/storage/
- Cleaned-up my residual files in my machine’s local storage volume /var/lib/libvirt/images
Passwords
CloudStack encrypts passwords from what I have seen in the MySQL database and configuration files. Indeed this is a change for the 3.0 release. To encrypt passwords like CloudStack one can do the following.
Admin Password Reset
When needing to reset the administrator password for CloudStack, one must resort to modifying the MySQL database, but the procedure is quite painless.
System VM passwords
There is a useful setting if you want to ensure the system VMs are only accessible via SSH key, called system.vm.random.password which should be good. I have verified the /etc/shadow file has a different hash for root between system VM instances, but I did have a problem on my first setting of this. I got the following log message, after seeing the management server was in a wonky state (no MySQL logins worked):
[cbaenziger1@cloud_machine ~]$ grep 'Error while decrypting:' /var/log/cloud/management/management-server.log 2012-10-15 06:27:28,120 DEBUG [utils.crypt.DBEncryptionUtil] (main:null) Error while decrypting: VG3fYbhx
Your failed decrypting string will likely vary; mine did! I verified and tried resolving the issue by doing the following:
mysql> USE cloud; mysql> SELECT name,value FROM configuration WHERE value LIKE "%VG3fYbhx%"; | system.vm.password | VG3fYbhx | mysql> UPDATE configuration SET value = "false" WHERE name = "system.vm.random.password"; Query OK, 1 row affected (0.07 sec) Rows matched: 1 Changed: 1 Warnings: 0
But, I still had issues starting the Management Server:
2012-10-15 06:38:28,977 DEBUG [utils.crypt.DBEncryptionUtil] (main:null) Error while decrypting: VG3fYbhx 2012-10-15 06:38:28,978 ERROR [utils.component.ComponentLocator] (main:null) Unable to load configuration for management-server from components.xml net.sf.cglib.core.CodeGenerationException: org.jasypt.exceptions.EncryptionOperationNotPossibleException-->null
Realizing that the value in system.vm.password did not look like an encrypted password, I looked in the database for another encrypted string I could use and ended up copying the value from secstorage.copy.password. Then, I could start the Management Server; and have since re-enabled system.vm.random.password but I do not see the value in system.vm.password changing.
Default Passwords
I have also seen one security disclosure on CloudStack. And while CloudStack seems solid, like I do my Hadoop cluster (which as of CDH3U5 does not have such wholistic security) I will certainly keep my Cloud infrastructure off the hostile Internet.
Make sure to change the default for the admin user too!