Ever felt bewildered when troubleshooting a problem with your CentOS server? A handful of tools and best practices for troubleshooting can help you solve most of your problems swiftly and efficiently.
Some problems are easier than others to diagnose. For example, it's clear what the cause (and the solution) is when you see an error such as Web application cannot write to /var/www/cache/. Unfortunately, such detailed output is rarely available because it may raise performance or security issues.
Web application cannot write to /var/www/cache/
In many production environments, detailed logging that shows the result of every operation is too expensive, in terms of additional CPU and disk input/output operations. Nevertheless, checking the logs is the first step to follow once you realize that the solution to a problem is not obvious. By default, all logs in CentOS are located in the directory /var/log/. Some important log files are:
Logs differ between applications and services but most of them describe events specifying the time of occurrence, a severity level and a message. An example log entry from a MySQL server looks like this:
120503 7:34:22 [ERROR] Can't create IP socket: Too many open files in system
First columns show the date (120503, i.e. 3 May 2012) and time. ERROR is the severity level; importance in severity levels rises from debug, info, and notice messages that are informative rather than describing an issue. Next come warning and error – severity levels that point to an actual problem. Finally, the most alarming events are critical, alert, and emergency.
If the logs you examine don't reveal much information, try increasing the logging level by changing the configuration of a service or application you are troubleshooting. For example, if you are debugging web problems with Apache, edit the file /etc/httpd/conf/httpd.conf and change LogLevel warn to LogLevel debug. Most services and applications support debug level, which makes the logger describe in detail what is happening with a service.
Of course, information from logs is not always sufficient to help you zero in on the cause of a problem. After all, logs contain mostly predefined messages about break points that have been designed by programmers. That's why sometimes logs not only lack information but may also contain "unknown" errors or misleading messages.
You may run into problems if an application or service is unable to connect to your network. Check whether the application is really listening on the IP address and port that you expect it to be by executing the command netstat -ntulp, where n is for numeric output, t is for TCP, u is for UDP, l is for listening programs, and p shows also the program name along with the rest of the attributes. Here is an example of output you might see:
Active Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 2661/sshd tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 3546/master
Two things are important in the above example. First, sshd is listening on all local addresses (0.0.0.0) on port 22. This is how it is supposed to look when it works properly. However, master, which is the Postfix mail daemon, is listening only on the local 127.0.0.1 address. This is a typical example of a problem; remote clients would not be able to connect to and use the mail server. To fix it you would have to edit one of Postfix's configuration files, /etc/postfix/main.cf, and change inet_interfaces = localhost to inet_interfaces = all.
inet_interfaces = localhost
inet_interfaces = all
Sometimes programs don't listen on the interfaces (usually the external ones) they are supposed to, such as the mail server from the above example. That may be intentional; using a non-standard port can be a good security measure against unauthorized access. Other times the port specified for the application is a problem. The port could be already in use or the application may not have the necessary privileges to bind to that port; ports below 1024 require applications to be started with superuser privileges.
Next, check the firewall. CentOS has a strict firewall policy enabled by default that allows only SSH connections from outside and blocks external access to any other installed services. You can check the current firewall rules using the command /sbin/iptables -L -n, where L says to list the rules and n specifies numeric output. Here is the output from a default CentOS 6 iptables setup:
/sbin/iptables -L -n
Chain INPUT (policy ACCEPT)target prot opt source destination ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED ACCEPT icmp -- 0.0.0.0/0 0.0.0.0/0 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited Chain FORWARD (policy ACCEPT)target prot opt source destination REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited Chain OUTPUT (policy ACCEPT)target prot opt source destination
This shows that only ssh is allowed for incoming connections, while everything is allowed for outgoing connections. In more detail, there are three chains – INPUT for incoming packets, FORWARD for routed packets, and OUTPUT for outgoing packets. Next to each chain name is the default policy action, which is ACCEPT, indicating that by default everything is allowed. If the policy action were DROP, the logic of the following rules would be reversed. For more information, check the iptables howto.
The fastest way to eliminate the firewall as the cause for your problem is to disable it by executing the command service iptables stop, but think twice before you disable software designed to keep your system safe from malicious outsiders. If disabling the firewall resolves your issue, amend the firewall rules and save the changes. Here is a command that enables TCP port 80 for incoming connections to Apache in CentOS 6, saves the changes, and restarts the firewall service:
service iptables stop
/sbin/iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPTservice iptables saveservice iptables restart
The next step is to investigate running processes and the commands and files associated with them. You may unearth a zombie process that has to be killed before another application can start properly. Knowing all commands and files associated with a process can tell you how the process was started and by whom, which may help you troubleshoot security problems.
First, use the command /bin/ps auxf to list the current processes. The arguments mean the command should list all process (ax) along with their owners (u) in full format (f), which includes the commands that started them. If the list is too long you can use the pipe and grep commands to search for something specific.
Make sure that the owner has the correct access privileges to all necessary files. For example, an Apache web process under the name nobody should have full permissions for all files inside its document root directory. That's why the most practical and secure thing to do is to make the user nobody the owner of these files.
Knowing what files are associated with a program is useful because problems with a program's files can reflect on the program itself. The best way to see what files were opened by a program is to run /usr/sbin/lsof. Without any additional arguments, lsof lists all opened files system-wide for all users. You can refine the output by grepping for a specific command (process name), user, or file name.
One common Linux situation is having files locked by one process, preventing them from being written by another. lsof can debug such cases; you can grep its output for the locked file name. If the locking process is responsive, stop it by running its initialization script: service name_of_service stop. If it's not, you can kill the process by executing kill -9 pid_of_process.
service name_of_service stop
kill -9 pid_of_process
lsof shows even deleted files associated with currently running processes. This may help you to troubleshoot situations with files deleted on purpose, as may happen if you've been the victim of a security breach. Checking deleted files is also useful when troubleshooting free disk space problems. Even though a file has been deleted, its disk space might not have been reclaimed yet because a process has not fully released it. In such cases you have to stop the process to let the system reclaim the deleted file's space.
Finally, strace can help you trace system calls and signals. Use it by appending it before the command you are debugging, like this: /usr/bin/strace some_command. File errors are indicated by "No such file or directory" messages. However, these errors about missing files do not necessarily indicate a problem. This is be because a file might not be present when a feature is disabled or package is not installed, for example.
Problems with missing dependencies usually occur when your operating system or applications have been built via custom installations. CentOS installations using yum are straightforward because all dependencies are resolved automatically. However, when you perform manual installations, you can run into missing libraries and programs. It gets even worse when you are running a 64-bit version of CentOS and you are installing a program that has 32-bit dependencies. An example of such an error is some_app: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory.
some_app: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory
To rectify the problem, carefully read the installation manual and ensure that all required packages are installed. If you are not sure what package provides a given file or library, use yum whatprovides. If, for instance, you got the error about the missing libstdc++.so.5, execute yum whatprovides */libstdc++.so.5, and you will see that this file is provided by the compat-libstdc++ package.
yum whatprovides */libstdc++.so.5
If you have already installed the package that's responsible for the error, check whether the custom installation does not require 32-bit libraries. The easiest way to do this is by using the command ldd, as in ldd /full/path_to_custom_command, and see that dependencies are searched in /lib/ instead of /lib64/. In CentOS you can install packages for both 32- and 64-bit architectures, even though this is not a good practice from a performance or stability point of view.
In some extreme cases you may not be able to perform any of the above troubleshooting steps because your system has crashed and cannot even boot. In such cases, turn to the CentOS installation disk and its rescue mode.
First, prepare to boot your system from installation media containing an CentOS installation image. Even the minimal installation image is good for this. Then, from the first boot-up menu, choose "Rescue installed system" and follow the wizard that guides you through the process. At one point you are informed that the rescue environment will try to find your installation and mount it in /mnt/sysimage/. Choose "Continue" so that the system is mounted in read/write mode.
When the rescue wizard completes, choose to start shell in the final step. This gives you a bash shell, and you can go to your damaged installation by chrooting into it with the command chroot /mnt/sysimage/. After you do that you can work as if your old system has booted properly and perform all the troubleshooting you need.
All of these steps we've talked about may be helpful for you even if you have minimal CentOS background, though the better you understand CentOS, the more efficiently you can troubleshoot problems with it.
Allowed tags: <a> link, <b> bold, <i> italics