SATURDAY, APRIL 27, 2024

Posts Tagged ‘Solaris’

Oracle ® Solaris ® 11 Update 4 – Security

#solaris #solaris114beta

This post describes the security and compliance features that are new in this release. These new features help prevent new threats through anti-malware protection and enable you to meet the strictest compliance obligations.

The engineering concept is security at every level. There are a number of built-in defense technologies that prevent attackers from gaining access and establishing a foothold in your data-center.

Secure Sandboxes
Oracle Solaris systems can run periodic assessments of the system security posture with the ability to have compliance results either pushed or pulled to a central location over a secure transport. The ability to graph historical compliance assessment status at the security benchmark and individual check layer is provided via compliance integration with the Oracle Solaris Web Dashboard. With multinode compliance you can centrally gather a compliance assessment for multiple instances which can be very beneficial for the development and deployment of applications where you want to ensure the entire set of instances are compliant and ready to roll out without needing to individually check each instance. Sandboxes are uniquely named sets of process attributes that can be used to specify security and resource isolation requirements. In Oracle Solaris 11.4, you can execute untrusted processes in temporary sandboxes.

What’s New in Oracle® Solaris 11.4 2
Persistent and hierarchical sandboxes can be created by using the sandboxadm command. Both temporary and persistent sandboxes can be entered by using the sandbox command. Sandboxes are suitable for constraining both privileged and unprivileged applications. Enhanced exploit mitigation controls leveraging SPARC Silicon Secured Secure Memory automatically protect key applications and the system kernel.
For more information, see “Configuring Sandboxes for Project Isolation” in Securing Users and Processes
in Oracle Solaris 11.4 and the sandboxing(7), sandbox(1), and sandboxadm(8) man pages.

Scheduled Compliance Assessment
An Oracle Solaris instance should be able to declare the compliance benchmarks that it is assessed against.
The instance should also periodically assess itself against that policy.
This Oracle Solaris release introduces two new features that support the ability to run compliance reports
against specific compliance benchmarks.
The existing SMF compliance service is modified as follows:
■ The current :default instance performs guide generation, so this instance is renamed to the :generateguide
instance and is enabled by default.
■ policy is the new SMF property group that is added to declare the policy or benchmark that the system
is to be assessed against:
<property_group name=’policy’ type=’application’>
<propval name=’value_authorisation’ type=’astring’
value=’solaris.compliance.assess’ />
<propval name=’benchmark’ type=’astring’ value=’solaris’ />
<propval name=’profile’ type=’astring’ value=’Baseline’ />
<propval name=’tailoring’ type=’astring’ value=” />
</property_group>
View and update compliance policy values with two new compliance subcommands that are
implemented over the SMF RAD transport:
compliance set-policy [ -b benchmark [ -p profile ]] [ -t tailoring ]
compliance get-policy
The get-policy output is not intended to be parsable, and its format can change without further notice.
For more information, see Oracle Solaris 11.4 Security Compliance Guide.

Per File Auditing
Per file auditing in Oracle Solaris 11.4 provides fine-grained, on-access auditing of specific files and directories. With this feature, system and security administrators can target specified files to be audited. The specified files can be accessed in certain ways, allowing for much easier collection and analysis of audit data.
For example:
# chmod A+everyone@:write_data/read_data:successful_access/failed_access:audit /data/db1
What’s New in Oracle® Solaris 11.4 3
This audit ACE ensures that an audit record is generated for any reads or writes, both success and denied access, on the /data/db1 file by any user on the system. Audit ACEs can also be added for metadata changes.
For more information, see “What’s New in the Audit Service in Oracle Solaris 11.4” in Managing Auditing in Oracle Solaris 11.4.

Verified Boot Auditing
In Oracle Solaris 11.4, this new feature helps you generate audit records to indicate the signature verification results of the kernel modules. The feature checks the Verified Boot boot_policy value when Oracle Solaris 11.4 boots, and outputs the value to an audit record for AUE_SYSTEMBOOT event. When Verified Boot is enabled with the value of boot_policy property as warning or enforce, Oracle Solaris audit produces AUE_MODLOAD audit events if an elfsign signature verification fails when a module is to be loaded. With Verified Boot enabled, you can keep track of events for kernel modules that have invalid signatures or signatures that have not been loaded into the system.
For more information, see “New Feature – Auditing Verified Boot” in Managing Auditing in Oracle Solaris
11.4.

Privileged Command Execution History Reporting
Oracle Solaris 11.4 introduces the admhist utility, which is used to provide a summary of system administration related events that have been run on the system, in a helpful, easy-to-understand format. The admhist utility leverages audit data that enables the praudit and auditreduce utilities to provide more detailed log analysis.
A variety of options are available that enable you to narrow the results by user, date, time, or type of event as follows. For example, you can identify privileged command executions by a particular user ID within the last 24 hours:
# admhist -v -a “last 24 hours”
2017-05-09 10:58:55 user1@example.com cwd=/export/home/user1 /usr/sbin/zfs get quota rpool/export/home/user1
2017-05-09 10:59:16 user1@example.com cwd=/export/home/user1 /usr/sbin/zfs set quota 40g
2017-05-09 10:59:27 user1@example.com cwd=/export/home/user1 /usr/sbin/zfs get quota rpool/export/home/user1
2017-05-09 10:59:31 user1@example.com cwd=/export/home/user1 /usr/bin/bash
2017-05-09 10:59:31 user1@example.com cwd=/ /usr/bin/su
The output illustrates that the user user1 switched to the root user and increased his quota. The privileges that are used throughout the life of the process are examined when the command exits, which is why the su operation is listed at the end of the output.
For more information, see the admhist(8) man page, “New Feature – Per-Privilege Logging of Audit
Events” in Managing Auditing in Oracle Solaris 11.4, and Using Oracle Solaris 11.4 Analytics.

KMIP Client Support
Oracle Solaris 11.4 provides client support for using the Key Management Interoperability Protocol (KMIP) version 1.1. A new PKCS#11 provider, pkcs11_kmip, is provided in the Oracle Solaris Cryptographic Framework.
What’s New in Oracle® Solaris 11.4 4
Framework, which enables PKCS#11 applications to function as KMIP clients and communicate to KMIPcompliant servers.
Oracle Solaris 11.4 also includes a new command, kmipcfg, which initializes and manages the states of the pkcs11_kmip provider.
For more information, see Chapter 5, “KMIP and PKCS #11 Client Applications” in Managing Encryption
and Certificates in Oracle Solaris 11.4 and the pkcs11_kmip(7) and kmipcfg(8) man pages.

File and Process Labeling
File and process labeling in Oracle Solaris 11.4 provides a framework for restricting access to sensitive information. Files and directories can now be labeled to provide access to users or roles with sufficient clearance. The clearance policy also applies to processes with all privileges. Oracle Solaris 11.4 can generate logs of every access to labeled files, which can be used to meet compliance standards such as PCIDSS and HIPAA.
For more information, see “Labels and Clearances” in Securing Files and Verifying File Integrity in Oracle Solaris 11.4 and the clearance(7) man page.

Silicon Secured Memory Security Exploit Mitigations
Silicon Secured Memory (SSM) adds real-time checking of access to data in memory to help protect against malicious intrusion and flawed program code in production for greater security and reliability. SSM is available via the default system memory allocator and is available inside a kernel zone. See “Silicon Secured Memory Support in Oracle Solaris Kernel Zones” on page 17.
The system default allocator (libc malloc) is now Application Data Integrity (ADI) aware. Binaries tagged with the sxadm command automatically receive the protection. See the ADIHEAP and ADISTACK protections in the Security Extensions section of the sxadm(8) man page. SSM application programming interfaces are available for advanced customization. See “Protecting Against Malware With Security Extensions” in Securing Systems and Attached Devices in Oracle Solaris 11.4 and the adi(2) man page.

Packet Filter
Oracle Solaris 11.4 includes the OpenBSD 5.5 Packet Filter (PF) firewall for filtering TCP/IP traffic. PF firewall is a replacement to the IP Filter (IPF) in Oracle Solaris 11.4, enabling both bandwidth management and packet prioritization. To use the PF firewall, install the pkg:/network/firewall package and enable the svc:/network/firewall:default service instance.
Note – Make sure you configure the firewall first. Enabling the firewall with the default configuration puts the service to a degraded state. The degraded firewall blocks all inbound sessions except ssh. Outbound sessions are allowed.
For more information, see Chapter 3, “Oracle Solaris Firewall” in Securing the Network in Oracle Solaris 11.4 and the pfctl(8), pf.conf(7), and pf.os(7) man pages.
What’s New in Oracle® Solaris 11.4 5

ftp-proxy
Oracle Solaris 11.4 supports ftp-proxy, a semi-transparent proxy for FTP, supporting IPv4 NAT. Systems running the PF firewall for NAT can use the ftp-proxy to allow FTP connections to pass through the firewall.
For more information, see the ftp-proxy(8) man page.

pflogd Daemon
Oracle Solaris 11.4 supports the pflogd feature, a packet logging daemon that safely saves packets logged by the PF firewall. These packets are available from a capture datalink. The daemon reads packets from this datalink and stores them into a file.
For more information, see the pflogd(8) man page.

Kerberos
Oracle Solaris 11.4 provides an updated version of Kerberos, which includes improvements from the latest version of MIT Kerberos, as well as enhancements made for Oracle Solaris. Kerberos provides network authentication, and optionally provides message integrity and privacy, depending on how an application uses it.
For more information, see Chapter 1, “Kerberos on Oracle Solaris” in Managing Kerberos in Oracle Solaris 11.4 and the kerberos.7 man page.

libsasl2
The Simple Authentication and Security Layer (SASL) framework provides authentication and optional security services for network protocols. Oracle Solaris 11.4 bases its SASL implementation on the open source Cyrus SASL version 2.1.26 with a few changes. The SASL plugins are in the /usr/lib/sasl2 directory, and the default location for the SASL configuration files is the /etc/sasl2 directory. By basing the SASL version on open source, Oracle Solaris 11.4 is able to provide the latest SASL features, including security updates.
For more information, see Chapter 2, “Using Simple Authentication and Security Layer” in Managing Authentication in Oracle Solaris 11.4. libucrypto Library
Oracle Solaris 11.4 includes libucrypto, a lightweight library that provides access to hardware accelerated cryptography. Operations provided include symmetric and asymmetric encryption, digital signatures, message authentication codes, and cryptographic hashes. The libucrypto library provides lightweight access to hardware cryptographic primitives, when you do not need access to key storage, session management, or the standards based APIs provided by libpkcs11.
The libucrypto library enables fast access to hardware-accelerated cryptography. The library is fast for both the programmer and the processor, as it avoids locking and session management overhead.
What’s New in Oracle® Solaris 11.4 6
For more information, see “Simple and Fast ucrypto Provider” in Managing Encryption and Certificates in Oracle Solaris 11.4 and the libpkcs11(3LIB) man page.

PKCS #11 v2.40 Support for Oracle Solaris Cryptographic Framework
The Oracle Solaris Cryptographic Framework has been updated from PKCS #11 v2.20 to PKCS #11 v2.40. The updates include some of the latest mechanisms in PKCS #11 v2.40 including those from PKCS #11 v2.30. A new error code and a new value have also been introduced in PKCS #11 v2.40. The following new mechanisms have been added:
AES signing and verification
CKM_AES_XCBC_MAC
CKM_AES_XCBC_MAC_96
CKM_AES_CMAC
CKM_AES_GMAC
AES encryption and decryption
CKM_AES_GCM
CKM_AES_CCM
CKM_AES_CFB128
SHA-512/t message digesting
CKM_SHA512_224
CKM_SHA512_256
CKM_SHA512_T
SHA-512/t general-length with HMAC
CKM_SHA512_224_HMAC_GENERAL
CKM_SHA512_256_HMAC_GENERAL
CKM_SHA512_T_HMAC_GENERAL
CKM_SHA512_224_HMAC
CKM_SHA512_256_HMAC
CKM_SHA512_T_HMAC
SHA-512/t key derivation
CKM_SHA512_224_KEY_DERIVATION
CKM_SHA512_256_KEY_DERIVATION
CKM_SHA512_T_KEY_DERIVATION
TLS 1.2
CKM_TLS12_MASTER_KEY_DERIVE
CKM_TLS12_MASTER_KEY_DERIVE_DH
CKM_TLS12_KEY_AND_MAC_DERIVE
CKM_TLS12_KEY_SAFE_DERIVE
CKM_TLS_KDF – replacing CKM_TLS_PRF
CKM_TLS_MAC – replacing CKM_TLS_PRF
■ Error code CKR_CURVE_NOT_SUPPORTED for elliptic curve
If a specific elliptic curve cannot be supported, then the error code CKR_CURVE_NOT_SUPPORTED is returned. In the previous version, CKR_TEMPLATE_INCONSISTENT was returned if the curve was not supported.
■ CK_UNAVAILABLE_INFORMATION
What’s New in Oracle® Solaris 11.4 7
When C_GetAttributeValue() is called, and if an attribute cannot be returned because of its invalidity or unavailability, ulValueLen is set to CK_UNAVAILABLE_INFORMATION. The caller has to check if the returned attribute value is invalid or unavailable by comparing ulValueLen with CK_UNAVAILABLE_INFORMATION. Moreover, the caller has to treat ulValueLen = 0 as a valid value.
■ Attributes CKA_DESTROYABLE and CKR_ACTION_PROHIBITED
If an object has CKA_DESTROYABLE = CK_FALSE, then a request to C_DestroyObject for this particular object should result in CKR_ACTION_PROHIBITED being returned as error code.
■ Removing Restrictions with CKU_SO
This change removes the restrictions on having R/O open while CKU_SO is logged in. While R/O sessions can now co-exist with CKU_SO, those sessions behave as CKS_RO_PUBLIC_SESSION. An R/O session cannot be used to C_Login with CKU_SO. CKR_SESSION_READ_ONLY_EXISTS and CKR_SESSION_READ_WRITE_SO_EXISTS are deprecated.
For more information, see the SUNW_C_GetMechSession(3EXT), SUNW_C_KeyToObject(3EXT),
libpkcs11(3LIB), pkcs11_softtoken(7), pkcs11_kms(7), and pkcs11_tpm(7) man pages.

Defense Against Malware – Tamper Evident Software, Application Sandboxing, Immutable Lifecycle.

Oracle® Solaris 11.4

#solaris #solaris114beta

There is a new update coming to Oracle® Solaris® 11 that carries with it a great deal of security additions and enhancements.
I’ll post some of them in the coming days but I first wanted to provide my thoughts on Solaris.

Obviously Solaris is a very robust, secure, mature operating environment. Even with recent announcements relating to Solaris and SPARC®.

SolSupportNote that Extended Support for Solaris 11 ends November 2034 and Sustaining Support is marked as indefinite. So, let’s put to rest the idea that Oracle is no longer supporting their premier Unix Operating Environment.  Was there restructuring? – Yes, was there an abandonment of Oracle Solaris, no, and the totality of the new features that we will covering in the coming days bears strong witness to that.

In my opinion, if you are currently running Solaris there is no immediate, urgent need to look at spending time, money, and effort to migrate away from it. When paired with the SPARC processor is a secure, fast platform engineered for large-scale enterprise deployment. It provides compliance monitoring, performance monitoring, and ZERO overhead virtualization.

I have some difficulty biting my lower lip when I hear “technologists” speak of the gloom and doom of the “M” series processor and/or SPARC given Oracle’s announcements relating to it.

Oracle JUST released the M8 in July of 2017.  The chip isn’t even a year old.  As I understand it (please realize that I do not speak for Oracle and make no guarantees about product availability) Oracle plans to continue to invest in producing robust SPARC based servers with improvements to I/O, Memory, etc. The M8 remains the constant but servers will continue to be designed with new technologies.

How can one easily overlook the M8? Clocked to 5Ghz, 32KB L1 instruction cache, 16KB L1 data cache, 256KB L2 instruction cache, 128KB L2 data cache, 64 MB of shared L3 cache, the ability to issue 4 instructions at a time, 32 Cores, 16GB memory pages, security and database acceleration hardware on the die itself.

Modern applications use many threads working on large shared-memory segments. Bugs or pointer problems in these applications can cause highly unpredictable behavior and consume excessive amounts of an application developer’s time to troubleshoot and diagnose. Silent data corruption and buffer overruns are two of these difficult-to-diagnose problems. For both problems, Silicon Secured Memory dramatically reduces the time it takes for application developers to troubleshoot memory reference bugs. For silent data corruption, Silicon Secured Memory can facilitate immediate action to be taken by the application, preventing costly recovery efforts.

A robust CPU and an enterprise ready OS!  As I stated above, there is no reason to look to move away from SPARC/SOLARIS in the near future.

In the coming days I’ll post a bit about.

  • Key Features in Oracle Solaris 11.4
  • Security and Compliance Features

  • Data Management Features

  • Networking Features

  • Performance and Observability

  • Virtualization Features

  • System Management Features

  • Installation and Software Management Features

  • Enhancements for Developers

 

 

FSS – More Process Scheduling

The last blog post gave us some brief descriptions of the various scheduling classes in Solaris. I focused on the Time Sharing (TS) class since it is the default. Hopefully we can see that the TS (and the IA class for that matter) makes its decisions based on how the threads are using the CPU. Are we CPU intensive or are we I/O intensive? It works well, but it doesn’t provide the administrator fine-grain control as it relates to resource management.

To address this, The Fair Share Scheduler (FSS) was added to Solaris in the Solaris 9 release.

The primary benefit of FSS is to allow the admin an ability to identify and dispatch processes and their threads based upon their importance as determined by the business and implemented by the administrator.

We saw the complexity of the TS dispatch table in the earlier post. Here we see the FSS table has no such complexity.

FSS Dispatch Table
#
# Fair Share Scheduler Configuration
#
RES=1000
#
# Time Quantum
#
QUANTUM=110

In FSS we use the concept of CPU shares. These shares allow the admin a fine level of granularity to carve up CPU resources. We are no longer limited to allocating an entire CPU. The admin designates the importance of the workload by assigning to it a number of shares. You dictate importance by assigning a larger number of shares to those workloads that carry a higher importance. Shares ARE NOT the same as CPU caps nor CPU resource usage. Shares simply define the relative importance of workloads in comparison to other workloads where CPU resource usage is an actual measurement of consumption. A workload may be given 50% of the shares yet at a point in time may be only consuming 5% of the CPU. I look at a CPU share as a minimum guaranty of CPU allocation, not as a cap on CPU consumption.

When we assign shares to a work load, we need to be aware of the shares that are already assigned. It is the ratio of shares assigned to one workload compared to all of the other workloads.

I speak of FSS in a “Horizontal” and a “Vertical” aspect when I’m delivering for Oracle University. In Solaris 9 we were able to define projects in the /etc/project file. This is the vertical aspect. In Solaris 10 Non-Global Zones were introduced and brought with it the Horizontal aspect. I assign shares horizontally across the various zones and then vertically within each zone in the /etc/project file if needed.

By default the Non-Global zones use the default scheduling class. If the system is updated with a new default class, they will obtain the new setting when booted or rebooted. The recommended scheduler to use with Non-Global Zones is the FSS. The preferred way is to set the system default scheduler to FSS and all zones then inherit it.

To display information about the loaded scheduling classes, run priocntl -l


root@solaris:~# priocntl -l
CONFIGURED CLASSES
==================

SYS (System Class)

TS (Time Sharing)
Configured TS User Priority Range: -60 through 60

SDC (System Duty-Cycle Class)

FX (Fixed priority)
Configured FX User Priority Range: 0 through 60

IA (Interactive)
Configured IA User Priority Range: -60 through 60

priocntl can be used to view or set scheduling parameters for a specified process.

To determine the global priority of a process run ps -ecl

root@solaris:~# ps -ecl #The c displays properties of the scheduler, we see the class (CLS) and the priority (PRI)
 F S    UID   PID  PPID  CLS PRI     ADDR     SZ    WCHAN TTY         TIME CMD
 1 T      0     0     0  SYS  96        ?      0          ?           0:01 sched
 1 S      0     5     0  SDC  99        ?      0        ? ?           0:02 zpool-rp
 1 S      0     6     0  SDC  99        ?      0        ? ?           0:00 kmem_tas
 0 S      0     1     0   TS  59        ?    720        ? ?           0:00 init
 1 S      0     2     0  SYS  98        ?      0        ? ?           0:00 pageout
 1 S      0     3     0  SYS  60        ?      0        ? ?           0:01 fsflush
 1 S      0     7     0  SYS  60        ?      0        ? ?           0:00 intrd
 1 S      0     8     0  SYS  60        ?      0        ? ?           0:00 vmtasks
 0 S      0   869     1   TS  59        ?   1461        ? ?           0:05 nscd
 0 S      0    11     1   TS  59        ?   3949        ? ?           0:11 svc.star
 0 S      0    13     1   TS  59        ?   5007        ? ?           0:32 svc.conf
 0 S      0   164     1   TS  59        ?    822        ? ?           0:00 vbiosd
 0 S     16   460     1   TS  59        ?   1323        ? ?           0:00 nwamd

To set the default scheduling class use dispadmin -d FSS and then dispadmin -d to ensure it changed. Then run dispadmin -l to see that it loaded.

root@solaris:~# dispadmin -d
dispadmin: Default scheduling class is not set
root@solaris:~# dispadmin -d FSS
root@solaris:~# dispadmin -d
FSS	(Fair Share)
root@solaris:~# dispadmin -l
CONFIGURED CLASSES
==================

SYS	(System Class)
TS	(Time Sharing)
SDC	(System Duty-Cycle Class)
FX	(Fixed Priority)
IA	(Interactive)
FSS	(Fair Share)

Manually move add of the running processes into the FSS class and then verify with the ps command.

root@solaris:~# priocntl -s -c FSS -i all
root@solaris:~# ps -ef -o class,zone,fname | grep -v CLS | sort -k2 | more
 FSS   global auditd
 FSS   global automoun
 FSS   global automoun
 FSS   global bash
 FSS   global bash
 FSS   global bonobo-a
 FSS   global clock-ap
 FSS   global console-
 FSS   global cron
 FSS   global cupsd
 FSS   global dbus-dae
 FSS   global dbus-dae
 FSS   global dbus-lau
 FSS   global dbus-lau

Finally move init over to the FSS class so all children will inherit.

root@solaris:~# ps -ecf | grep init
    root     1     0   TS  59 16:33:44 ?           0:00 /usr/sbin/init
root@solaris:~# priocntl -s -c FSS -i pid 1
root@solaris:~# ps -ecf | grep init
    root     1     0  FSS  29 16:33:44 ?           0:00 /usr/sbin/init

With the FSS all set, we now assign shares to our Non-Global Zones
zonecfg -z
set cpu-shares=number of shares
exit zonecfg

To display CPU consumption run prstat -Z

Solaris Process Scheduling

The Oracle Solaris kernel has a number of process scheduling classes available.

A brief review.

Timesharing (TS) This is the default class for processes and their associated kernel threads. Priorities in the class are dynamically adjusted based upon CPU utilization in an attempt to allocate processor resources evenly.

Interactive (IA) This is an enhanced version of TS. Some texts reference this in conjunction with TS, i.e. TS/IA. This class applies to the in-focus window in the GUI. It provides extra resources to processes associated with that specific window.

Fair Share Scheduler (FSS) This class is “share based” rather than priority based. The threads associated with this class are scheduled based on the associated shares assigned to them and the processor’s utilization.

Fixed-Priority (FX) Priorities for these threads are fixed regardless of how they interact with the CPU. They do not vary dynamically over the life of the thread.

System (SYS) Used to schedule kernel threads. These threads are bound meaning unlike the userland threads listed above they do not context switch off the CPU if their time quantum is consumed. They run until they are blocked or they complete.

Real-Time (RT) These threads are fixed-priority with a fixed time duration. They are the one of the highest priority classes with only interrupts carrying a higher priority.

As it relates to the priority ranges for the scheduling classes the userland classes (TS/IA/FX/FSS) carry the lowest priorities, 0-59. The SYS class is next ranging from 60-99. At the top (ignoring INT) is the RT class at 100-159.

We can mix scheduling classes on the same system but there are some considerations to keep in mind.

  • Avoid having the FSS, TS, IA, and FX classes share the same processor set (pset)
  • All processes that run on a processor set must be in the same scheduling class so they do not compete for the same CPUs
  • To avoid starving applications, use processor sets for FSS and FX class applications

TS and IA as well as FSS and RT can be in the same processor set.

We can look at how the TS class (default) makes its decisions by looking at the dispatch table itself
Dispatch Table
This table is indexed by the priority level of the thread. To understand an entry lets use priority 30 as an example.
The left most column is marked as ts_quantum -> Timesharing quantum. This specifies the time in milliseconds (identified by the RES=1000) that the thread will be allocated before it will be involuntary context-switched off the CPU.

A context switch is the process of storing and restoring the state of a process so that execution can be resumed at the same point at a later time. We store this state in the Light Weight Process (LWP) that the thread was bound to. Basically a thread binds to a LWP. A LWP binds to a kernel thread (kthr) and the kernel thread is presented to the kernel dispatcher. When the thread is placed on the CPU (hardware strand/thread) the contents of the LWP is loaded onto the CPU and the CPU starts execution at that point. When the thread is removed from the CPU (it is preempted, the time quantum is consumed, it sleeps) the contents of the CPU registers are loaded into the LWP and then it is removed from the CPU and returns to the dispatch queue to compete again based on priority with the other threads that request access to the CPU.

So, at a priority of 30 the thread has 80 milliseconds to complete its work or it will be forced off the CPU. In the event that it does not complete its work, the system will context switch the thread off the CPU AND change its priority. We see the next column ts_tqexp ->Timesharing time quantum expired. This column identifies the new priority of the thread. In this case ts_tqexp is now 20. So, we consumed our time quantum, we were involuntary context switched off the CPU, and we had our priority lowered when we returned to the dispatch queue. At a priority of 20 our time quantum is now 120 milliseconds. Lowered priority to keep the thread from “hogging” the CPU but an increase in the time quantum in hopes that when we do get back on the CPU we have more time to complete our work.

The next column identifies our new priority when the thread returns from a sleep state. There is no reason to keep a thread on a CPU if there is no work to be done. When the thread enters a sleep state we leave the CPU, this is a VOLUNTARY context switch and we are placed on the sleep queue. When we leave the sleep queue we are not placed back on the CPU. We are placed back on the dispatch queue to compete with the other threads to gain access to the CPU. Since we have been off the CPU for a period of time we advance the priority in this case we were initially dispatched at 30, we voluntary context switched off the CPU, and when we woke we were given the priority of 53. Notice that at a priority of 53, or new time quantum is 40. The priority increased from 30 to 53 but the time quantum decreased from 80 to 40. We get you back on the CPU faster but limit the amount of time you get on the CPU.

The context and involuntary context switching can be seen in a mpstat commmand. csw and icsw
mpstat

The last two columns deal with the attempt to prevent CPU starvation. ts_maxwait is a measurement (in seconds) that if exceed without access to the CPU the value of ts_lwait is assigned. notice that for all but priority 59 that this value is set to 0. So when we exceed 0 (meaning we have been off the CPU for 1 second) we are assigned the value of ts_lwait. Again using 30 as our example we would go from a priority of 30 to a priority of 53 if we were prevented access to the CPU for 1 second.

In the middle of all of this we have preemption. The Solaris Kernel is fully preemptible. All threads, even SYS threads, will be preempted if a higher priority thread hits the kernel dispatcher while a lower priority thread is running. The thread isn’t allowed to complete its time quantum, it is context switched off the CPU.

And don’t forget, there is IA, FSS, FX, SYS, RT, and INT threads that adds to the chaos if allowed and why I provided some of the guidance I listed earlier.

We see some use of FX and quite a bit more of the FSS with Solaris zones. I’ll talk about FSS in another post.

I’ll spend a better part of a day whiteboarding all of this in the Oracle University Solaris Performance Management Class.

Solaris 10

Solaris 11

Translation Lookaside Buffer

Teaching Solaris Performance Management this week and we got into a large discussion about T-Series CPUs, Multi-threaded -vs- Multi-Process applications, Multiple Page Size Support (MPSS), and the Translation Lookaside Buffer (TLB).

Solaris processes run in a virtual memory address space. When we attempt to utilize that memory address space something needs to map that virtual address to an actual physical address. On the SPARC platform the Hardware Address Translation layer (named SFMMU – Spitfire Memory Management Unit) performs this function. The MMU divides the virtual address space into pages. Solaris supports MPSS so we can change the size of these pages on both SPARC as well as x86.

The pagesize -a command will display the available page sizes on the system.

$ uname -a
SunOS chicago 5.10 Generic_127112-11 i86pc i386 i86pc Solaris
$ pagesize -a
4096
2097152
$ uname -a
SunOS niagara 5.10 Generic_138888-03 sun4v sparc SUNW Solaris
$ pagesize -a
8192
65536
4194304
268435456

Virtual Memory would not be very effective if every memory address had to be translated by looking up the associated physical page in memory. The solution is to cache the recent translations in a Translation Lookaside Buffer (TLB). A TLB has a fixed number of slots that contain Translation Table Entries (TTE), which map virtual addresses to physical addresses.

Modern servers today have multiple cores with multiple hardware strands allowing the system to dispatch a large number of threads to a CPU. Each of the processes associated with these threads will need to gain access to the physical memory location placing a burden on the TLB and the HAT. Simply put, there may not be enough space in the TLB to hold the Translation Table Entries (TTE) to hold all of the needed translations required by the large number of running processes.

To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler. Quite a bit of complex work to handle these “misses”.

Starting with Solaris 10, Update 1 the Out-Of-The-BOX (OOB) Large Page Support turns on MPSS automatically for the applications heap and text(libraries). The advantage is that it improves the performance of your userland applications by limiting/reducing the CPU cycles required to service dTLB and iTLB misses. Theoretically we are mapping a larger amount of memory in the TLB if we choose to map larger pages.

For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn’t support large pages, it will be mapped on to 32,768 8K pages.

The pmap command displays the page sizes of memory mappings within the address space of a process. The -sx option directs pmap to show the page size for each mapping.

sol10# pmap -sx ´pgrep testprog´
2909:  ./testprog
 Address Kbytes   RSS  Anon Locked Pgsz Mode  Mapped File
00010000    8    8    -    -  8K r-x-- dev:277,83 ino:114875
00020000    8    8    8    -  8K rwx-- dev:277,83 ino:114875
00022000 131088 131088 131088    -  8K rwx--  [ heap ]
FF280000   120   120    -    -  8K r-x-- libc.so.1
FF29E000   136   128    -    -  - r-x-- libc.so.1
FF2C0000   72   72    -    -  8K r-x-- libc.so.1
FF2D2000   192   192    -    -  - r-x-- libc.so.1
FF302000   112   112    -    -  8K r-x-- libc.so.1
FF31E000   48   32    -    -  - r-x-- libc.so.1
FF33A000   24   24   24    -  8K rwx-- libc.so.1
FF340000    8    8    8    -  8K rwx-- libc.so.1
FF390000    8    8    -    -  8K r-x-- libc_psr.so.1
FF3A0000    8    8    -    -  8K r-x-- libdl.so.1
FF3B0000    8    8    8    -  8K rwx--  [ anon ]
FF3C0000   152   152    -    -  8K r-x-- ld.so.1
FF3F6000    8    8    8    -  8K rwx-- ld.so.1
FFBFA000   24   24   24    -  8K rwx--  [ stack ]
-------- ------- ------- ------- -------
total Kb 132024 132000 131168    -.

There may be some instances where OOB may cause poor performance of some of your applications including application crashes if the application makes an improper assumption regarding page sizes. If one runs into this scenario there are adjustments that can be made in /etc/system to enable or disable OOB support.

It also can introduce challenges on some caches and their coherency. On multi-threaded applications run on CMP and SMP systems, threads from a common PID can be dispatched to different CPUs, each holding their own TTEs in the TLB. When a thread unmaps virtual memory we have to perform a cleanup. There may now be a stale mapping on a different TLB that now maps to an invalid physical memory location that if allowed to remain could allow for corruption. Those CPUs that are crosscalled during a munmap that have actually run the process are cleaned up instead of just broadcasting it to all of the running CPUs. However as we add more processors, this can increase the time it takes to perform this cleanup. If you think this may be occurring if you migrate to a larger system, consider using processor pools or CPU binding of the process to see if that allows some relief.

We can use the trapstat command to gain some reference into the performance of our dTLB and iTLB hit rates. By specifying the -T option, trapstat shows TLB misses broken down by page size. In this example, CPU 0 is spending 7.9 percent of its time handling user-mode TLB misses on 8K pages, and another 2.3 percent of its time handling user-mode TLB misses on 64K pages.

example# trapstat -T -c 0
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
  0 u   8k|      1300  0.1        15  0.0 |    104897  7.9        90  0.0 | 8.0
  0 u  64k|         0  0.0         0  0.0 |     29935  2.3         7  0.0 | 2.3
  0 u 512k|         0  0.0         0  0.0 |      3569  0.2         2  0.0 | 0.2
  0 u   4m|         0  0.0         0  0.0 |       233  0.0         2  0.0 | 0.0
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
  0 k   8k|        13  0.0         0  0.0 |     71733  6.5       110  0.0 | 6.5
  0 k  64k|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
  0 k 512k|         0  0.0         0  0.0 |         0  0.0       206  0.1 | 0.1
  0 k   4m|         0  0.0         0  0.0 |         0  0.0         0  0.0 | 0.0
==========+===============================+===============================+====
      ttl |      1313  0.1        15  0.0 |    210367 17.1       417  0.2 |17.5

By specifying the -e option, trapstat displays statistics for only specific trap types. Using this option minimizes the probe effect when seeking specific data. This example yields statistics for only the dtlb-prot and syscall-32 traps on CPUs 12 through 15:

example# trapstat -e dtlb-prot,syscall-32 -c 12-15
vct  name               |    cpu12    cpu13    cpu14    cpu15
------------------------+------------------------------------
 6c dtlb-prot           |      817      754     1018      560
108 syscall-32          |     1426     1647     2186     1142

vct  name               |    cpu12    cpu13    cpu14    cpu15
------------------------+------------------------------------
 6c dtlb-prot           |     1085      996      800      707
108 syscall-32          |     2578     2167     1638     1452

cpustat allows another point of entry into monitor events on the CPU to include the workings of the TLB. The following command displays the three CPUs with the highest DTLB_miss rate.

example% cpustat -c DTLB_miss -k DTLB_miss -n 3 1 1

 time cpu event DTLB_miss
1.040 115  tick       107
1.006  18  tick        98
1.045 126  tick        31
1.046  96 total       236

event DTLB_miss
total       236

There is quite a bit more to think about relating to MPSS and the TLBs, I hope this post serves as a starting point for those that are running CMP/SMP systems with multi-threaded applications to perform a deeper dive.

Take a look at (pmap -sx) (ppgsz) (pagesize) (mpss.so.1) for additional direction.