I am a former Infrastructure Consultant and I worked in operations, outsourcing and even in support.
I would like to share my best practices on SAP HANA Operation and Maintenance. I will focus on infrastructure setup and high availability topics.
Keep in mind that those ideas are my best practices and may not fit to your landscape.
All comments are for production systems, but might help with non production systems,too.
I will focus on Tailored Data Integration in virtual environments, because appliances are easier. They are already certified by SAP. On the other hand they lack the flexibility of the TDI.
Another important topic is that TDI installation should done by vendor consultants (TDI) or certified experts.
Some info for TDI users;
On TDI setups, performance responsibility is on customer/partner. There is a SAP tool, called hardware checktool,which should be run after each HANA installation.
Please keep your hardware checktool output which has performance and landscape control KPI’s.
Remember, you can and you should run hardware checktool after each configuration change.
Running HANA DB on virtual environments:
Customers are moving to virtual deployments, because they are easier to manage and cost effective.
It is a good practice to run HANA DB’s on virtual environments, but there are strict rules to for that. Otherwise there can be a negative performance impact.
Admin teams on each landscape have their own habits and best practices, which sometimes do not fit on HANA best practices.
I will give insights about Vmware and PowerVM setups.
Let’s focus possible issues on Vmware Installations.
– High CPU overcommit ratios
– Huge datastores
– Not caring about numa affinity
Production HANA installation rules are strict. (2020) – Please check sap note 2393917 –
– Four is the maximum production HANA servers which can run on a 2 socket system. You can only create 0.5, 1, 2 socket VMs. No odd vm’s like 1.5 ( or 2.5 for 4 sockets)
– Three production HANA DB plus any nonproduction or non-hana workloads is an option. You cannot share half of a socket with a non-hana workload.
Cut each cpu into half, that is the smallest portion that you can attach to a HANA DB, including attached memory. You can not do incremental changes to resources.
If you pass the borders of 1/4 RAM of Total (1/2 of cpu socket ram), the next step is 1/4 more, you should not add partial memory to the Virtual machine.
Reason: Each cpu socket has 12 dram channels(x86_64), which has dedicated bandwidth, If you create a VM with 1/2 cores on a socket, memory bandwidth is guaranteed for that VM.
◉ If you keep core/vcpu count and add more memory to a Virtual Machine, that memory will be attached over other cpu cores/sockets. That is called “far memory” and has higher latency, which has a performance impact. I call so VM’s, sad VM’s.
Sounds confusing but let me give an example.
A two socket server has 56 cores, and 3 TB of memory. It has 24 DRAM’s 256GB each. Each socket addresses, means connected to 12 DRAM modules. You should create 1 HANA DB with 750 GB/14cores, 1.5TB/28cores or , or 3TB 56cores. You should not create a server with 8 cores and 2TB, because those 8 cores cannot address all that memory locally.
This is my first diagram ever, hope it will help you understand.
VM1 (yellow) has 2 cores and 768GB of local memory. That is the target.
VM2 (green) has 2 cores but 1024GB of memory which is not available locally. Some portion of the memory will be accessed remotely which has a higher latency.
Core5-Core6 and other memory resources are allocated by other virtual machines.
CPU overcommit is another issue, that can have a huge impact on performance. ESX is a very efficient hypervisor but it has also technical limits. CPU overcommitting means assigning more vCPU’s to the Virtual machines than available. I see 2X values and that is just fine. Hypervisor will just switch vCPU’s from one real core to another, effectively share the cpu time. This operation is called content switching and it is visible from linux terminal. It has a small, neglectable performance impact, caused by cpu cache miss. Cores have caches, which is far faster than conventional memory.
Allocating half socket virtual machine helps hypervisor to pinpoint cpu memory connection and system will better cache hit rate.
A real life example, a benchmark that I did personally, a cpu pinned 16 vcpu Vmware ESX guest will get 15-20% better SpecJVM result than a regular guest. Pinned result is almost identical to bare metal. Pinned cpu assignment means, those cores are only usable for that specific guest, running 1:1.
Current CPU’s are very strong and administrators want to use them effectively and they like to position HANA DB Virtual Machines like any other workload.
It gets interesting when there is a resource battle between virtual machines. Again let me give an example.
We have again 56 cores, 112 vCPU server, and we assigned 250 vCPU’s to virtual machines. On an easy day there will be no performance issue.
At the end of the month, there are heavy calculations on each virtual server on that host. Each VM asks hypervisor for cpu time, but there is not enough available. Admin overcommitted each VM with 8-16 vcpu’s, because this is easier to manage.
On each cpu cycle only 112 vCPU’s can work (because there is only 112vcpu’s) and 250-112=128 vCPU’s wait. If some vcpu’s were idle, that would be no problem, but if they also ask for resources, they must wait.
You can also monitor this behavior on %steal percentage on top/mpstat command output. That is the cpu cycle percentage which is “stolen” from the guest/virtual machine by hypervisor.
There are many ways to overcome this issue, but sticking SAP notes would be my choice
IBM Power servers have higher memory bandwidth, and tools for numa/affinity monitoring tuning. With dedicated cpu assignment, you do not care about overcommit or noisy neighbor issue.
Just keep in mind that enterprise server Power 9-E980 has buffered memory with 210GB/s per socket and HANA DB will perform extremely well. P9- E950 and below is on par with x86_64 on memory bandwidth. (~130GB/s)
Power 8 series are using only buffered memory.(~200GB/s)
My best practices for PowerVM are;
◉ Configure dual virtual i/o servers.
◉ RMC connection is a must.
◉ Use NPIV instead of VSCSI
◉ If you insist on vscsi please check your queue_depth values.
◉ Use shared processor with desired of 4 cores minimum for VIOS. (1 entitled 4 vCPU/32LCPU max)
◉ On some landscapes I see 1 dedicated core for Virtual I/O Servers, even default values and documentation indicates one. That is bad practice, a heavy network load will consume 2-3+ cores, and your whole system will lag very badly. I am not even talking about high priority of nw i/o over disk i/o.
◉ 4 cores/32Lcpu’s for production are minimum values regarding to SAP notes, but increase it parallel to the memory size.
◉ Check memory affinity with lsmemopt command on hmc, if it is defragmented, optimize it. (chmemopt)
◉ If it is still not on an optimized state, do the following.
◉ Check CPU/memory ratio, make it sensible ( not 4 cores / 2TB of memory, 4 cores cannot directly access/address so much memory locally.)
◉ Start the server with the greatest memory so hypervisor can fit it correctly.
My hypervisor recommendations are finished, lets continue with Input/Output.
HANA DB is an in-memory database, and it is not heavy disk I/O dependent other than startup.
One exception is log area. Database logs need fast disk access. Backups also do heavy sequential writes.
As a former Linux expert, I have some recommendations.
◉ Use a Separate /hana/log logical volume. Each filesystem has separate journal.
◉ Use LVM and use mulitple Luns for hana data area, stripe the lvm. Each luns will have its own queue_depth.
◉ Please configure multiple datastores on hypervisor/host, match that to the FC paths.
25 TB lun with a queue_depth of 64 is a joke.
◉ Some storage controllers cannot work active/active and if you just create one datastore (LUN), it can only use the resources of one storage controller!
CPU power, caches are not being touched, you will run active/passive storage.
◉ Read Intensive SSD disks are not cost effective for backup solutions, they might wear down quickly on heavy backups. You can use NL-SAS instead. Backups are high block sized sequential write operations.
While creating file systems, leave 2-3GB free on the volume group where loglv resides. I do not know why but I am called more than 10 times for /hana/log fillup. If that area fills up, database will stop, if this is an appliance you are in trouble. On TDI you might add another lun and resize.
When I was a vendor consultant I always did the extra free space trick, and won the hearts of my customers. When the time comes, I just tell them resize the logv by 500MB (not full resize, they might fill it again), resize the filesystem and please check and fix your log backup retention.
One more last thing, you can increase but cannot shrink xfs filesystems, don’t ask me why.That is the downside of Xfs. ( Did I tell you that I really do not like BtrFS?)
Switch to noop I/O scheduler as recommended by SAP, if you are using enterprise storage. Deadline performs also very well if you are on sas localdisks. If you are on suse11, default i/o scheduler is cfq and does not fit DB workloads. For and flash based disk noop should be the preffered choice, ssd disks are so fast, they do not a scheduler. Noop means no scheduling.
Network Infrastructure:
I will focus on it in High Availability topic.
HANA DB can run SuSE or Redhat Distributions. Latest editions of Linux Distributions use journalctl logging system which does not “persist” some OS logs, means if the systemss reboots, logs are gone!
I really like to question this decision, but for now, I suggest enabling persisten logging.
mkdir /var/log/journal
systemd-tmpfiles --create --prefix /var/log/journal
systemctl restart systemd-journald
If this is a multipath disk setup, check your multipath configuration accordingly. Wrong multipath configuration will have a big impact on disk subsystem performance. Multipath service is not enabled on initial OS installations when there is no multipath device present.
It means if you attach multipath devices later, you should enable multipath/device-mapper service manually.
Test your performance with native tools (fio,sysbench,iometer) or use sap hardware checktool.
Monitoring i/o performance on multipath devices can be tricky, use my uber script mtools.sh.
https://github.com/tayore/ozantools
Never, ever install any application on root file system including hana and /usr/sap (yes that too). Separate OS and application, and if this is a bare metal server, take an image of root file system. There are opensource applications like clonezilla, relax and recover. I worked with them and they do the job.
I would like to share my best practices on SAP HANA Operation and Maintenance. I will focus on infrastructure setup and high availability topics.
Keep in mind that those ideas are my best practices and may not fit to your landscape.
All comments are for production systems, but might help with non production systems,too.
Infrastructure:
I will focus on Tailored Data Integration in virtual environments, because appliances are easier. They are already certified by SAP. On the other hand they lack the flexibility of the TDI.
Another important topic is that TDI installation should done by vendor consultants (TDI) or certified experts.
Some info for TDI users;
On TDI setups, performance responsibility is on customer/partner. There is a SAP tool, called hardware checktool,which should be run after each HANA installation.
Please keep your hardware checktool output which has performance and landscape control KPI’s.
Remember, you can and you should run hardware checktool after each configuration change.
Running HANA DB on virtual environments:
Customers are moving to virtual deployments, because they are easier to manage and cost effective.
It is a good practice to run HANA DB’s on virtual environments, but there are strict rules to for that. Otherwise there can be a negative performance impact.
Admin teams on each landscape have their own habits and best practices, which sometimes do not fit on HANA best practices.
I will give insights about Vmware and PowerVM setups.
Let’s focus possible issues on Vmware Installations.
– High CPU overcommit ratios
– Huge datastores
– Not caring about numa affinity
Production HANA installation rules are strict. (2020) – Please check sap note 2393917 –
– Four is the maximum production HANA servers which can run on a 2 socket system. You can only create 0.5, 1, 2 socket VMs. No odd vm’s like 1.5 ( or 2.5 for 4 sockets)
– Three production HANA DB plus any nonproduction or non-hana workloads is an option. You cannot share half of a socket with a non-hana workload.
Cut each cpu into half, that is the smallest portion that you can attach to a HANA DB, including attached memory. You can not do incremental changes to resources.
If you pass the borders of 1/4 RAM of Total (1/2 of cpu socket ram), the next step is 1/4 more, you should not add partial memory to the Virtual machine.
Reason: Each cpu socket has 12 dram channels(x86_64), which has dedicated bandwidth, If you create a VM with 1/2 cores on a socket, memory bandwidth is guaranteed for that VM.
◉ If you keep core/vcpu count and add more memory to a Virtual Machine, that memory will be attached over other cpu cores/sockets. That is called “far memory” and has higher latency, which has a performance impact. I call so VM’s, sad VM’s.
Sounds confusing but let me give an example.
A two socket server has 56 cores, and 3 TB of memory. It has 24 DRAM’s 256GB each. Each socket addresses, means connected to 12 DRAM modules. You should create 1 HANA DB with 750 GB/14cores, 1.5TB/28cores or , or 3TB 56cores. You should not create a server with 8 cores and 2TB, because those 8 cores cannot address all that memory locally.
This is my first diagram ever, hope it will help you understand.
VM1 (yellow) has 2 cores and 768GB of local memory. That is the target.
VM2 (green) has 2 cores but 1024GB of memory which is not available locally. Some portion of the memory will be accessed remotely which has a higher latency.
Core5-Core6 and other memory resources are allocated by other virtual machines.
Allocating half socket virtual machine helps hypervisor to pinpoint cpu memory connection and system will better cache hit rate.
A real life example, a benchmark that I did personally, a cpu pinned 16 vcpu Vmware ESX guest will get 15-20% better SpecJVM result than a regular guest. Pinned result is almost identical to bare metal. Pinned cpu assignment means, those cores are only usable for that specific guest, running 1:1.
Current CPU’s are very strong and administrators want to use them effectively and they like to position HANA DB Virtual Machines like any other workload.
It gets interesting when there is a resource battle between virtual machines. Again let me give an example.
We have again 56 cores, 112 vCPU server, and we assigned 250 vCPU’s to virtual machines. On an easy day there will be no performance issue.
At the end of the month, there are heavy calculations on each virtual server on that host. Each VM asks hypervisor for cpu time, but there is not enough available. Admin overcommitted each VM with 8-16 vcpu’s, because this is easier to manage.
On each cpu cycle only 112 vCPU’s can work (because there is only 112vcpu’s) and 250-112=128 vCPU’s wait. If some vcpu’s were idle, that would be no problem, but if they also ask for resources, they must wait.
You can also monitor this behavior on %steal percentage on top/mpstat command output. That is the cpu cycle percentage which is “stolen” from the guest/virtual machine by hypervisor.
There are many ways to overcome this issue, but sticking SAP notes would be my choice
PowerVM and PowerPC architecture.
IBM Power servers have higher memory bandwidth, and tools for numa/affinity monitoring tuning. With dedicated cpu assignment, you do not care about overcommit or noisy neighbor issue.
Just keep in mind that enterprise server Power 9-E980 has buffered memory with 210GB/s per socket and HANA DB will perform extremely well. P9- E950 and below is on par with x86_64 on memory bandwidth. (~130GB/s)
Power 8 series are using only buffered memory.(~200GB/s)
My best practices for PowerVM are;
◉ Configure dual virtual i/o servers.
◉ RMC connection is a must.
◉ Use NPIV instead of VSCSI
◉ If you insist on vscsi please check your queue_depth values.
◉ Use shared processor with desired of 4 cores minimum for VIOS. (1 entitled 4 vCPU/32LCPU max)
◉ On some landscapes I see 1 dedicated core for Virtual I/O Servers, even default values and documentation indicates one. That is bad practice, a heavy network load will consume 2-3+ cores, and your whole system will lag very badly. I am not even talking about high priority of nw i/o over disk i/o.
◉ 4 cores/32Lcpu’s for production are minimum values regarding to SAP notes, but increase it parallel to the memory size.
◉ Check memory affinity with lsmemopt command on hmc, if it is defragmented, optimize it. (chmemopt)
◉ If it is still not on an optimized state, do the following.
◉ Check CPU/memory ratio, make it sensible ( not 4 cores / 2TB of memory, 4 cores cannot directly access/address so much memory locally.)
◉ Start the server with the greatest memory so hypervisor can fit it correctly.
My hypervisor recommendations are finished, lets continue with Input/Output.
I/O General:DISK Infrastructure:
HANA DB is an in-memory database, and it is not heavy disk I/O dependent other than startup.
One exception is log area. Database logs need fast disk access. Backups also do heavy sequential writes.
As a former Linux expert, I have some recommendations.
◉ Use a Separate /hana/log logical volume. Each filesystem has separate journal.
◉ Use LVM and use mulitple Luns for hana data area, stripe the lvm. Each luns will have its own queue_depth.
◉ Please configure multiple datastores on hypervisor/host, match that to the FC paths.
25 TB lun with a queue_depth of 64 is a joke.
◉ Some storage controllers cannot work active/active and if you just create one datastore (LUN), it can only use the resources of one storage controller!
CPU power, caches are not being touched, you will run active/passive storage.
◉ Read Intensive SSD disks are not cost effective for backup solutions, they might wear down quickly on heavy backups. You can use NL-SAS instead. Backups are high block sized sequential write operations.
While creating file systems, leave 2-3GB free on the volume group where loglv resides. I do not know why but I am called more than 10 times for /hana/log fillup. If that area fills up, database will stop, if this is an appliance you are in trouble. On TDI you might add another lun and resize.
When I was a vendor consultant I always did the extra free space trick, and won the hearts of my customers. When the time comes, I just tell them resize the logv by 500MB (not full resize, they might fill it again), resize the filesystem and please check and fix your log backup retention.
One more last thing, you can increase but cannot shrink xfs filesystems, don’t ask me why.That is the downside of Xfs. ( Did I tell you that I really do not like BtrFS?)
Switch to noop I/O scheduler as recommended by SAP, if you are using enterprise storage. Deadline performs also very well if you are on sas localdisks. If you are on suse11, default i/o scheduler is cfq and does not fit DB workloads. For and flash based disk noop should be the preffered choice, ssd disks are so fast, they do not a scheduler. Noop means no scheduling.
Network Infrastructure:
I will focus on it in High Availability topic.
Operating System Guidelines:
HANA DB can run SuSE or Redhat Distributions. Latest editions of Linux Distributions use journalctl logging system which does not “persist” some OS logs, means if the systemss reboots, logs are gone!
I really like to question this decision, but for now, I suggest enabling persisten logging.
mkdir /var/log/journal
systemd-tmpfiles --create --prefix /var/log/journal
systemctl restart systemd-journald
If this is a multipath disk setup, check your multipath configuration accordingly. Wrong multipath configuration will have a big impact on disk subsystem performance. Multipath service is not enabled on initial OS installations when there is no multipath device present.
It means if you attach multipath devices later, you should enable multipath/device-mapper service manually.
Test your performance with native tools (fio,sysbench,iometer) or use sap hardware checktool.
Monitoring i/o performance on multipath devices can be tricky, use my uber script mtools.sh.
https://github.com/tayore/ozantools
Never, ever install any application on root file system including hana and /usr/sap (yes that too). Separate OS and application, and if this is a bare metal server, take an image of root file system. There are opensource applications like clonezilla, relax and recover. I worked with them and they do the job.
No comments:
Post a Comment