Module 15: System Troubleshooting & Recovery¶

1. Learning Objectives¶

Master systematic troubleshooting methodologies
Diagnose and resolve boot, network, and service failures
Analyze system performance issues and resource constraints
Recover from filesystem corruption and storage failures
Troubleshoot SELinux, firewall, and security-related issues
Implement preventive maintenance and monitoring strategies
Document troubleshooting procedures and solutions

2. Key Concepts¶

Troubleshooting Methodology¶

Problem identification: Define symptoms and scope
Information gathering: Collect system state and logs
Analysis: Correlate data and identify root causes
Solution implementation: Apply fixes systematically
Verification: Confirm problem resolution
Documentation: Record solutions for future reference

System State Analysis¶

Boot process: GRUB, kernel, systemd initialization
Service status: systemd unit states and dependencies
Resource utilization: CPU, memory, disk, network usage
Log analysis: System and application log examination
Configuration validation: Syntax and logical correctness

Recovery Techniques¶

Boot recovery: Rescue mode, emergency mode, single-user mode
Filesystem repair: fsck, xfs_repair, data recovery
Service restoration: Dependency resolution, configuration fixes
Network recovery: Interface configuration, routing, DNS
Security recovery: SELinux troubleshooting, permission fixes

Diagnostic Tools¶

System information: lscpu, lsmem, lsblk, lspci, lsusb
Performance monitoring: top, htop, iotop, vmstat, iostat
Network diagnostics: ping, traceroute, netstat, ss, tcpdump
Storage analysis: df, du, lsof, fuser, blkid

3. Essential Commands¶

System Information Gathering¶

# Hardware information
lscpu                                               # CPU information
lsmem                                               # Memory information
lsblk                                               # Block device information
lspci                                               # PCI device information
lsusb                                               # USB device information
dmidecode                                           # DMI/SMBIOS information

# System state
uptime                                              # System uptime and load
uname -a                                            # Kernel and system info
hostnamectl                                         # System hostname info
timedatectl                                         # Time and timezone info
systemctl status                                    # Overall system status

Process and Resource Analysis¶

# Process monitoring
ps aux                                              # Process snapshot
ps -ef --forest                                    # Process tree
top -b -n1                                         # One-time top output
htop                                                # Interactive process viewer
pstree                                              # Process tree visualization

# Resource utilization
free -h                                             # Memory usage
df -h                                               # Disk usage
du -sh /path/*                                     # Directory sizes
lsof                                                # Open files
fuser -v /path/file                                # Processes using file

Network Diagnostics¶

# Network connectivity
ping -c 4 target                                    # Test connectivity
traceroute target                                   # Route tracing
mtr target                                          # Combined ping/traceroute

# Network configuration
ip addr show                                        # Interface addresses
ip route show                                       # Routing table
ss -tulnp                                          # Socket statistics
netstat -rn                                        # Routing table (legacy)

# DNS resolution
nslookup hostname                                   # DNS lookup
dig hostname                                        # Detailed DNS query
host hostname                                       # Simple DNS lookup

Service Troubleshooting¶

# Service analysis
systemctl status service_name                       # Service status
systemctl is-active service_name                   # Check if active
systemctl is-enabled service_name                  # Check if enabled
systemctl list-dependencies service_name           # Service dependencies

# Configuration validation
nginx -t                                            # Nginx config test
httpd -t                                            # Apache config test
sshd -T                                             # SSH config test
postfix check                                       # Postfix config test

Log Analysis¶

# System logs
journalctl -b                                       # Current boot logs
journalctl --since "1 hour ago"                    # Recent logs
journalctl -u service_name                         # Service-specific logs
journalctl -p err                                  # Error-level messages
journalctl -f                                      # Follow logs (tail -f)

# Traditional logs
tail -f /var/log/messages                          # System messages
grep -i error /var/log/messages                    # Error messages
awk '/ERROR/ {print $1, $2, $3, $NF}' /var/log/secure  # Extract error info

4. Asghar Ghori's Approach¶

Systematic Problem Analysis¶

Ghori emphasizes structured troubleshooting workflow:

# Step 1: Problem definition and scope
echo "Problem: Service X not responding"
echo "Scope: Single service on one server"
echo "Impact: Users cannot access application"
echo "Timeline: Started 2 hours ago"

# Step 2: Initial system health check
uptime && free -h && df -h
systemctl --failed                                 # Failed services
journalctl -p err --since "3 hours ago"           # Recent errors

# Step 3: Service-specific analysis
systemctl status httpd
journalctl -u httpd --since "3 hours ago"
httpd -t                                           # Config validation

Boot Troubleshooting Methodology¶

Ghori's systematic boot problem resolution:

# Boot analysis workflow
# 1. Identify boot stage failure
dmesg | grep -i error                              # Kernel messages
journalctl -b | grep -i fail                      # Boot failures

# 2. GRUB issues
grub2-editenv list                                 # Check default kernel
grub2-mkconfig -o /boot/grub2/grub.cfg            # Regenerate GRUB config

# 3. Filesystem issues
mount | grep " / "                                 # Check root filesystem
fsck /dev/sda1                                     # Filesystem check (unmounted)

# 4. Service startup issues
systemctl list-units --failed                     # Failed services
systemctl list-jobs                               # Pending jobs

Network Troubleshooting Steps¶

# Ghori's network diagnosis process
# 1. Physical/Link layer
ip link show                                       # Interface status
ethtool eth0                                       # Interface details

# 2. Network layer
ip addr show                                       # IP configuration
ip route show                                      # Routing table
ping -c 4 gateway_ip                              # Gateway connectivity

# 3. Application layer
ss -tulnp | grep :80                              # Service listening
curl -I http://localhost                          # Local service test
nmap -p 80 target_server                          # Remote service test

5. Sander van Vugt's Approach¶

Advanced Diagnostic Techniques¶

Van Vugt focuses on deep system analysis:

# Comprehensive system performance analysis
# 1. CPU analysis
vmstat 1 10                                        # CPU and memory stats
mpstat 1 5                                         # Per-CPU statistics
sar -u 1 10                                        # CPU utilization over time

# 2. Memory analysis
vmstat -s                                          # Memory statistics
slabtop                                            # Kernel slab allocation
/proc/meminfo                                      # Detailed memory info

# 3. I/O analysis
iostat -x 1 10                                     # Extended I/O statistics
iotop                                              # I/O by process
lsof +D /path                                      # Files open in directory

Root Cause Analysis Framework¶

Van Vugt's systematic root cause identification:

# Multi-layer analysis approach
# 1. Hardware layer
dmesg | grep -i "hardware error"                   # Hardware issues
mcelog --client                                    # Machine check errors
smartctl -a /dev/sda                               # Disk health

# 2. Kernel layer
dmesg | grep -i "kernel"                           # Kernel messages
cat /proc/sys/kernel/tainted                       # Kernel taint status
modinfo module_name                                # Module information

# 3. Application layer
strace -p PID                                      # System call tracing
ltrace -p PID                                      # Library call tracing
gdb --pid PID                                      # Debug running process

Advanced Log Correlation¶

# Van Vugt's log correlation methodology
# 1. Timeline reconstruction
journalctl --since "2023-01-01 14:00" --until "2023-01-01 15:00" | head -50
aureport --start 01/01/2023 14:00:00 --end 01/01/2023 15:00:00

# 2. Multi-source correlation
# Combine system logs, application logs, and audit logs
tail -f /var/log/messages /var/log/secure /var/log/audit/audit.log

# 3. Pattern analysis
awk '/pattern/ {count++} END {print count}' /var/log/messages
grep -E "ERROR|CRITICAL|FATAL" /var/log/application.log | sort | uniq -c

6. Command Examples and Scenarios¶

Scenario 1: Service Startup Failure¶

# Problem: Web server won't start after system reboot
# Systematic diagnosis:

# 1. Check service status
systemctl status httpd
systemctl is-enabled httpd

# 2. Check configuration
httpd -t
ls -la /etc/httpd/conf/httpd.conf

# 3. Check dependencies
systemctl list-dependencies httpd
systemctl status network.target

# 4. Check logs
journalctl -u httpd --since boot
grep httpd /var/log/messages

# 5. Check ports and firewall
ss -tulnp | grep :80
firewall-cmd --list-services

Scenario 2: System Performance Degradation¶

# Problem: System running slowly, high load average
# Performance analysis:

# 1. Overall system health
uptime                                             # Load averages
free -h                                            # Memory usage
df -h                                              # Disk space

# 2. Process analysis
top -b -n1 -o %CPU | head -15                     # CPU-intensive processes
ps aux --sort=-%mem | head -10                    # Memory-intensive processes

# 3. I/O analysis
iostat -x 1 5                                     # I/O wait times
iotop -ao                                          # I/O by process

# 4. Network analysis
ss -s                                              # Socket summary
netstat -i                                         # Interface statistics

Scenario 3: Boot Failure Recovery¶

# Problem: System won't boot, dropped to emergency shell
# Recovery procedure:

# 1. Check filesystem integrity
mount -o remount,rw /                              # Remount root writable
fsck /dev/sda2                                     # Check root filesystem

# 2. Check and fix fstab
cat /etc/fstab                                     # Review mount points
blkid                                              # Verify UUIDs

# 3. Regenerate initramfs if needed
dracut -f                                          # Force regenerate initramfs

# 4. Fix GRUB if necessary
grub2-install /dev/sda
grub2-mkconfig -o /boot/grub2/grub.cfg

7. Lab Exercises¶

Lab 15A: Service and Configuration Troubleshooting (Ghori-focused)¶

Time Limit: 30 minutes Objective: Diagnose and resolve common service configuration issues

Prerequisites: - RHEL 9 system with intentionally misconfigured services - Apache httpd and SSH services installed

Setup (Instructor creates these issues): 1. Apache httpd service fails to start due to configuration syntax error 2. SSH service running but refusing connections due to permission issue 3. Network service configured with conflicting IP addresses 4. Cron service not executing jobs due to permission problems

Tasks: 1. Identify and fix Apache configuration syntax error 2. Resolve SSH connection issues and verify remote access 3. Correct network configuration conflicts 4. Troubleshoot and fix cron job execution problems 5. Document all findings and solutions

Verification Commands:

systemctl status httpd sshd                        # Service status
curl http://localhost                               # Test web service
ssh localhost id                                   # Test SSH access
ip addr show                                        # Network configuration
crontab -l && grep CRON /var/log/cron             # Cron verification

Lab 15B: Performance and Resource Troubleshooting (van Vugt-focused)¶

Time Limit: 35 minutes Objective: Analyze and resolve system performance issues using advanced diagnostic techniques

Prerequisites: - RHEL 9 system with performance monitoring tools installed - Simulated high load conditions

Setup (Instructor creates these conditions): 1. Memory leak causing system slowdown 2. High I/O wait times due to disk issues 3. Network connectivity problems affecting services 4. CPU-intensive process consuming resources

Tasks: 1. Identify memory leak source and implement solution 2. Diagnose and resolve I/O performance bottleneck 3. Troubleshoot network connectivity issues 4. Find and manage resource-intensive processes 5. Create monitoring strategy to prevent recurrence

Verification Commands:

free -h && vmstat 1 3                              # Memory status
iostat -x 1 3                                      # I/O performance
ss -tulnp && ping -c 4 8.8.8.8                    # Network status
top -b -n1 | head -15                             # Process overview

Lab 15C: Synthesis Challenge - Complete System Recovery¶

Time Limit: 45 minutes Objective: Perform comprehensive system recovery using integrated troubleshooting methodologies

Prerequisites: - RHEL 9 system with multiple simulated failures - Access to rescue media and documentation

Setup (Multiple interconnected issues): 1. Boot failure due to corrupted filesystem 2. Network services not starting due to SELinux denials 3. Storage issues affecting application data 4. Security configuration preventing user access 5. Logging system failures hiding other issues

Tasks: 1. Recover system from boot failure using rescue mode 2. Resolve SELinux issues preventing service startup 3. Repair storage problems and recover application data 4. Fix security configuration issues 5. Restore logging functionality and analyze root causes 6. Implement preventive measures and monitoring 7. Create comprehensive incident report

Advanced Requirements: - Combine both Ghori's systematic approach and van Vugt's deep analysis - Use multiple diagnostic tools and correlation techniques - Document complete recovery timeline and lessons learned

Verification Commands:

systemctl status && systemctl --failed             # Overall system health
mount && df -h                                     # Storage status
getenforce && ausearch -m AVC -ts recent          # SELinux status
journalctl --disk-usage && journalctl -p err      # Logging status
ss -tulnp | grep -E ":22|:80|:443"                # Critical services

8. Troubleshooting Common Issues¶

Boot Failure Scenarios¶

# GRUB not loading
# Symptoms: System boots directly to BIOS/UEFI
# Solution: Reinstall GRUB bootloader
grub2-install /dev/sda
grub2-mkconfig -o /boot/grub2/grub.cfg

# Kernel panic
# Symptoms: Kernel panic messages, system halt
# Solution: Boot with older kernel or recovery mode
# From GRUB menu: select older kernel version

# Root filesystem corruption
# Symptoms: Cannot mount root filesystem
# Solution: Boot to rescue mode and run fsck
mount -o remount,ro /
fsck /dev/sda2                                     # Replace with correct device
mount -o remount,rw /

Network Connectivity Issues¶

# No network connectivity
# Symptoms: Cannot reach external hosts
# Diagnosis and resolution:

# 1. Check interface status
ip link show                                       # Interface up/down status
nmcli device status                                # NetworkManager status

# 2. Check IP configuration
ip addr show                                       # IP addresses assigned
ip route show                                      # Routing table

# 3. Test connectivity layers
ping -c 4 127.0.0.1                               # Loopback test
ping -c 4 gateway_ip                              # Gateway test
ping -c 4 8.8.8.8                                 # External IP test
ping -c 4 google.com                              # DNS resolution test

# 4. Fix common issues
systemctl restart NetworkManager                   # Restart network service
nmcli connection up connection_name                # Bring up connection

High Load and Performance Issues¶

# System running slowly
# Symptoms: High load average, slow response
# Analysis and solutions:

# 1. Identify resource constraints
uptime                                             # Load averages
free -h                                            # Memory availability
df -h                                              # Disk space

# 2. Find resource consumers
ps aux --sort=-%cpu | head -10                     # CPU usage
ps aux --sort=-%mem | head -10                     # Memory usage
iotop -ao                                          # I/O activity

# 3. Address specific issues
# Kill runaway processes
kill -TERM PID
kill -KILL PID                                     # If TERM doesn't work

# Clean up disk space
du -sh /var/log/* | sort -h                       # Find large log files
journalctl --vacuum-time=1week                    # Clean journal logs

Service Dependencies and Failures¶

# Service won't start due to dependencies
# Symptoms: Service fails with dependency errors
# Resolution approach:

# 1. Check service dependencies
systemctl list-dependencies service_name
systemctl status dependency_service

# 2. Start dependencies manually
systemctl start dependency_service
systemctl enable dependency_service

# 3. Check for circular dependencies
systemctl show service_name | grep -E "Requires|After|Before"

# 4. Resolve configuration issues
# Check service configuration files
# Validate syntax where applicable
service_name -t                                    # If applicable

9. Best Practices¶

Troubleshooting Methodology¶

Document all symptoms before making changes
Follow systematic approach from general to specific
Make one change at a time and test results
Keep detailed log of all actions taken
Back up configuration files before modifications
Have rollback plan for all changes

Information Gathering¶

Collect system information immediately when issue occurs
Preserve log files and system state for analysis
Use multiple information sources for correlation
Take screenshots or save command output
Interview users about what they were doing when issue occurred

Solution Implementation¶

Test solutions in non-production environment first
Implement least disruptive solution first
Monitor system closely after implementing fixes
Document all changes made for future reference
Verify that solution doesn't create new problems

Preventive Measures¶

Implement comprehensive monitoring and alerting
Perform regular system health checks
Keep system and applications updated
Maintain current documentation and runbooks
Regular backup and disaster recovery testing
Train team on common troubleshooting procedures

10. Integration with Other RHCSA Topics¶

Service Management Integration¶

Understand systemd service dependencies and failures
Troubleshoot service startup and runtime issues
Analyze service logs and performance metrics
Implement service monitoring and alerting

Storage Integration¶

Diagnose filesystem corruption and recovery procedures
Troubleshoot LVM and storage performance issues
Implement storage monitoring and capacity planning
Recover from storage hardware failures

Security Integration¶

Troubleshoot SELinux denials and policy issues
Diagnose firewall rule conflicts and connectivity problems
Investigate security incidents and unauthorized access
Implement security monitoring and incident response

Network Integration¶

Diagnose network connectivity and performance issues
Troubleshoot DNS resolution and service discovery
Analyze network traffic and security events
Implement network monitoring and capacity planning

Module 15 Summary: System troubleshooting is the culmination of all RHCSA skills, requiring deep understanding of Linux system components and their interactions. This module provides comprehensive coverage of systematic troubleshooting methodologies, from basic problem identification to complex system recovery scenarios. Mastering both structured diagnostic approaches and advanced analysis techniques is essential for RHCSA certification and effective system administration in production environments. The synthesis of different troubleshooting philosophies ensures comprehensive problem-solving capabilities across all system components.