1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714
| Understanding page frames and pages Memory in Linux is organized in the form of pages (typically 4 KB in size). Contiguous linear addresses within a page are mapped onto contiguous physical addresses on the RAM chip. However congtiguous pages can be present anywhere on the physical RAM. Access rights and physical address mapping in the kernel is done at a page level rather than for every linear address. A page refers both to the set of linear addresses that it contains as well as to the data contained in this group of addresses.
The paging unit thinks of all physical RAM as partitioned into fixed-length page frames. Each page frame contains a page. A page frame is a constituent of main memory, and hence it is a storage area. It is important to distinguish a page from a page frame; the former is just a block of data, which may be stored in any page frame or on disk. The paging unit translates linear addresses into physical ones. One key task in the unit is to check the requested access type against the access rights of the linear address. If the memory access is not valid, it generates a Page Fault exception (see Chapter 4 and Chapter 8). The data structures that map linear to physical addresses are called page tables ; they are stored in main memory and must be properly initialized by the kernel before enabling the paging unit.
Pages can optionally be 4 MB in size. However this is not advised except for applications where the expected data unit is large.
The kernel considers the following page frames as reserved:
Those falling in the unavailable physical address ranges Those containing the kernel's code and initialized data structures A page contained in a reserved page frame can never be dynamically assigned or swapped to disk. As a general rule, the Linux kernel is installed in RAM starting from the physical address 0x00100000 i.e., from the second megabyte. The total number of page frames required depends on how the kernel is configured. A typical configuration yields a kernel that can be loaded in less than 3 MB of RAM
The remaining portion of the RAM barring the reserved page frames is called dynamic memory. It is a valuable resource, needed not only by the processes but also by the kernel itself. In fact, the performance of the entire system depends on how efficiently dynamic memory is managed. Therefore, all current multitasking operating systems try to optimize the use of dynamic memory, assigning it only when it is needed and freeing it as soon as possible.
The kernel must keep track of the current status of each page frame. For instance, it must be able to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free. A page frame in dynamic memory is free if it does not contain any useful data. It is not free when the page frame contains data of a User Mode process, data of a software cache, dynamically allocated kernel data structures, buffered data of a device driver, code of a kernel module, and so on
Allocating memory to processes A kernel function gets dynamic memory in a fairly straightforward manner since the kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.
When allocating memory to User Mode processes, the situation is entirely different:
Process requests for dynamic memory are considered non-urgent. When a process's executable file is loaded, for instance, it is unlikely that the process will address all the pages of code in the near future. Similarly, when a process invokes malloc( ) to get additional dynamic memory, it doesn't mean the process will soon access all the additional memory obtained. Thus, as a general rule, the kernel tries to defer allocating dynamic memory to User Mode processes. Because user programs cannot be trusted, the kernel must be prepared to catch all addressing errors caused by processes in User Mode. When a User Mode process asks for dynamic memory, it doesn't get additional page frames; instead, it gets the right to use a new range of linear addresses, which become part of its address space. This interval is called a "memory region". A memory region consists of a range of linear addresses representing one or more page frames. Each memory region therefore consists of a set of pages that have consecutive page numbers.
Following are some typical situations in which a process gets new memory regions:
A new process is created A running process decides to load an entirely different program (using exec()). In this case, the process ID remains unchanged, but the memory regions used before loading the program are released and a new set of memory regions is assigned to the process A running process may perform a "memory mapping" on a file A process may keep adding data on its User Mode stack until all addresses in the memory region that map the stack have been used. In this case, the kernel may decide to expand the size of that memory region A process may create an IPC-shared memory region to share data with other cooperating processes. In this case, the kernel assigns a new memory region to the process to implement this construct A process may expand its dynamic area (the heap) through a function such as malloc( ). As a result, the kernel may decide to expand the size of the memory region assigned to the heap Demand paging The term demand paging denotes a dynamic memory allocation technique that consists of deferring page frame allocation until the last possible moment until the process attempts to address a page that is not present in RAM, thus causing a Page Fault exception
Fig 9.4
The motivation behind demand paging is that processes do not address all the addresses included in their address space right from the start; in fact, some of these addresses may never be used by the process. Moreover, the program locality principle ensures that, at each stage of program execution, only a small subset of the process pages are really referenced, and therefore the page frames containing the temporarily useless pages can be used by other processes. Demand paging is thus preferable to global allocation (assigning all page frames to the process right from the start and leaving them in memory until program termination), because it increases the average number of free page frames in the system and therefore allows better use of the available free memory. From another viewpoint, it allows the system as a whole to get better throughput with the same amount of RAM.
The price to pay for all these good things is system overhead: each Page Fault exception induced by demand paging must be handled by the kernel, thus wasting CPU cycles. Fortunately, the locality principle ensures that once a process starts working with a group of pages, it sticks with them without addressing other pages for quite a while. Thus page Fault exceptions may be considered rare events.
An addressed page may not be present in main memory either because the page was never accessed by the process, or because the corresponding page frame has been reclaimed by the kernel.
Overcommitting memory Linux allws overcommitting memory to processes. As we have seen that even though a process may malloc() 1 GB, Linux does not issue it 1 GB immediately, but rather only issues memory when the process actually needs it. Additionally Linux can overcommit the memory allocation. So if 5 processes each ask for 1 GB but the total amount of RAM and swap add up to only 4 GB, Linux may still allocate the 5 GB without any error. The overcommit settings depend on overcommit_memory and overcommit_ratio settings of the vm. Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt for further details on these parameters. In most cases overcommitting will not have any negative impact on the system unless you know your processes will utilize all of the memory that they are granted, and no addtl memory will be left over. On the other hand overcommitting does not have any advantage in server environments where capacity planning and calculations should be accurately performed.
atop shows the overcommit limit and the current committed memory, but this can be a bit misleading. I explain this calculation below
atop output: MEM | tot 11.7G | free 75.5M | cache 3.9G | dirty 66.7M | buff 42.1M | slab 198.8M | SWP | tot 2.0G | free 2.0G | vmcom 9.2G | vmlim 7.8G |
meminfo output [user@server ~]$ cat /proc/meminfo MemTotal: 12305340 kB MemFree: 73672 kB Buffers: 43120 kB Cached: 4074220 kB SwapTotal: 2048276 kB SwapFree: 2047668 kB Dirty: 62236 kB Slab: 203948 kB CommitLimit: 8200944 kB Committed_AS: 9630052 kB [user@server ~]$
Note: slight differences in the above two are from the fact that meminfo was run a few seconds after atop From the above we can conclude the following - Total memory: 11.7 GB Memory used for the disk cache: 3.9 GB Memory used for buffers and slab: ~240 MB Memory free: ~75 MB Memory actually used by processes: 11.7 - (3.9 + 0.24 + 0.075) => 7.485 Note: This can also be roughly estimated from the RSS of all processes. However the Resident size of each process will also contain shared memory making this difficult to estimate Committed_AS field tells us the amount of memory we have committed to these processes => 9630052 KB => ~9.2 GB. Therefore these processes could theoretically ask for upto 9.2 GB The documentation of the CommitLimit field tells us - Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). On our system (and on most default systems) overcommit_memory is set to "1", which means When this flag is 1, the kernel pretends there is always enough memory until it actually runs out. So as we can see this overcommit_limit figure is irrelevant. The only thing relevant here is that incase the processes on the system do need 9.2 GB instead of their current 7.48 then that space will be most likely reduced from the disk cache (currently at 3.9 GB). Page faults and swapping Page faults and swapping are two independent processes. Page faults take place when a process requests for an address that has been allocated to it but has not yet been assigned to it. Upon receving this request the Kernel confirms that the address being requested belongs to the process and if so then allocates a new page frame from memory and assigns it to the process.
Swapping occurs in one of two scenarios -
When the kernel needs to allocate a page of memory to a process and finds that there is no memory available. In this case the kernel must swap out the least used pages of an existing process into the swap space (on disk) and allocate those page frames to the requesting process There is a kernel parameter that determines swappiness of the kernel. The value is between 1 to 100 and is set to around 60 by default. A value of 100 means that the kernel will be considerably agressive when it comes to preferring allocatoin of memory to disk cache over processes. A value of 60 can result in occasional swapping out of process owned pages onto disk to make room for additional pages for the disk cache In general page faults are rare since they only occur when a process needs to access latent memory space. Infact on a long running server where there are no new processes being forked, page faults should almost never occur
Swapping is bad for performance and should also never occur in a well planned deployment. Swapping will almost always signify that your server does not have adequate memory to run all its processes. Infact during constant swapping all your memory is used up by existing processes. There is no memory available for the disk cache either. Constant swapping can bring a server to a standstill. It is important to note that lack of memory for the page cache will never cause swapping. It is only when there is no memory available for your processes that swapping occurs.
a better description of swapping when the kernel has no free space it needs to free up memory it has the following options drop a disk buffer cache page that is not dirty flush dirty pages and drop them move a page used by a process to disk it uses the following rough algorithm to figure this out is there inactive memory that can be reclaimed by dropping a page? if not then it can do one of the below - write a dirty page to disk and reclaim it reclaim an active non dirty disk buffer cache page swap-out a user mode process page to disk depending on the value of swappiness it will prefer swap out over reclaiming disk buffer or vice versa whenever that user mode process needs that page the same will be swapped in in an idea world there should be no swap out and definitely not any swap-ins since that signifies that the system is low on memory seeing swap ins may also signify that the swappiness value is inccorectly set based on the type of workload. for instance in appservers where the only disk activity maybe logging or some such ancillary activity one may want to set swappiness to a lower value before assuming that one has run out of memory incase of swwing swpins and outs VmSize, Resident size and Actual size of a process The resident size of a process (as shown in top or ps) represents the amount of non-swapped memory the kernel has already allocated to the process. This number is inaccurate when totalled (especially in a multi=process app like postgres or apache) since it includes shared memory. This also does not include the swapped out portion of the process. VmSize is the total memory of a program including its resident size, swap size, code, data, shared libraries etc. The SWAP column in top is calculated using VsSize - RSS which I believe is an incorrect calculation. Lets take an example and uinderstand these numbers better -
[user@server ~]$ cat /proc/9894/status Name: java State: S (sleeping) VmPeak: 4109896 kB VmSize: 4099492 kB VmLck: 0 kB VmHWM: 2855336 kB VmRSS: 2848964 kB VmData: 4000304 kB VmStk: 84 kB VmExe: 36 kB VmLib: 65392 kB VmPTE: 5940 kB
VmPeak: Peak virtual memory size. VmSize: Virtual memory size. VmLck: Locked memory size (see mlock(3)). VmHWM: Peak resident set size ("high water mark"). VmRSS: Resident set size. VmData, VmStk, VmExe: Size of data, stack, and text segments. VmLib: Shared library code size. VmPTE: Page table entries size (since Linux 2.6.10). We can conclude from the above -
Total program size is 4099492 KB => 3.9 GB. I actually dont know what this number signifies. I do know it accounts for the Resident size of the program plus swap plus other files. However at the time the above snapshot was taken there was zero swap utilization Current actual physical mem usage by the program = 2848964 => 2.71 GB Max actual physical mem usage by the program in its history since startup => 2855336 => 2.72 GB There is another aspect to remember here. Even though the resident size of the above program is 2.71 GB that too does not mean that the program is actually using 2.71 GB at this time. For instance in the above java program, java requests the kernel to consistently provide it additional memory whenever it needs addtl memory upto the limit specified for the java process. This memory is resident memory (unless a portion is swapped out). However after running an intensive process when java clears a large set of objects through a gc, this memory is not given back to the OS. The actual memory used by java at a point in time maybe significantly lesser than RSS. This can be measured independently provided the process allows you to do so.
Note that the VmHWM parameter is interesting inasmuch as it signifies the amount of physical memory required for the process at peak times.
Types of page faults Minor page fault: If the page is loaded in memory at the time the fault is generated, but is not marked in the memory management unit as being loaded in memory, then it is called a minor or soft page fault. The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory; it does not need to read the page into memory. This could happen if the memory is shared by different programs and the page is already brought into memory for other programs.
Major page fault: If the page is not loaded in memory at the time the fault is generated, then it is called a major or hard page fault. The page fault handler in the operating system needs to find a free page in memory, or choose a page in memory to be used for this page's data, write out the data in that page if it hasn't already been written out since it was last modified, mark that page as not being loaded into memory, read the data for that page into the page, and then make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory. Major faults are more expensive than minor page faults and may add disk latency to the interrupted program's execution. This is the mechanism used by an operating system to increase the amount of program memory available on demand. The operating system delays loading parts of the program from disk until the program attempts to use it and the page fault is generated.
Invalid page fault: If a page fault occurs for a reference to an address that's not part of the virtual address space, so that there can't be a page in memory corresponding to it, then it is called an invalid page fault. The page fault handler in the operating system then needs to terminate the code that made the reference, or deliver an indication to that code that the reference was invalid.
Understanding the Linux page cache (More details available in the disk IO section)
The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.
Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds (unless an explicit fsync() is called), thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).
Typically the kernel will use as much of the dynamic memory available to it for the page cache, only reclaiming page frames from the page cache peridically or as and when needed by a process or by newer pages that need to be written into the page cache. When the system load is low, the RAM is filled mostly by the disk caches and the few running processes can benefit from the information stored in them. However, when the system load increases, the RAM is filled mostly by pages of the processes and the caches are shrunken to make room for additional processes. Page reclaiming by default uses an LRU algorithm.
Read http://www.redhat.com/magazine/001nov04/features/vm/ for details on the lifecycle of a memory page
Understanding the PFRA The objective of the page frame reclaiming algorithm (PFRA ) is to pick up page frames and make them free. The PFRA is invoked under different conditions and handles page frames in different ways based on their content.
The PFRA is invoked on one of the following -
Low on memory reclaiming - The kernel detects a "low on memory" condition Periodic reclaiming - A kernel thread is activated periodically to perform memory reclaiming, if necessary The types of pages are as follows -
Unreclaimable Free pages (included in buddy system lists) Reserved pages (with PG_reserved flag set) Pages dynamically allocated by the kernel Pages in the Kernel Mode stacks of the processes Temporarily locked pages (with PG_locked flag set) Memory locked pages (in memory regions with VM_LOCKED flag set) Swappable Anonymous pages in User Mode address spaces Mapped pages of tmpfs filesystem (e.g., pages of IPC shared memory) Syncable Mapped pages in User Mode address spaces Pages included in the page cache and containing data of disk files Block device buffer pages Pages of some disk caches (e.g., the inode cache ) Discardable Unused pages included in memory caches (e.g., slab allocator caches) Unused pages of the dentry cache In the above table, a page is said to be mapped if it maps a portion of a file. For instance, all pages in the User Mode address spaces belonging to file memory mappings are mapped, as well as any other page included in the page cache. In almost all cases, mapped pages are syncable: in order to reclaim the page frame, the kernel must check whether the page is dirty and, if necessary, write the page contents in the corresponding disk file.
Conversely, a page is said to be anonymous if it belongs to an anonymous memory region of a process (for instance, all pages in the User Mode heap or stack of a process are anonymous). In order to reclaim the page frame, the kernel must save the page contents in a dedicated disk partition or disk file called "swap area" therefore, all anonymous pages are swappable
When the PFRA must reclaim a page frame belonging to the User Mode address space of a process, it must take into consideration whether the page frame is shared or non-shared . A shared page frame belongs to multiple User Mode address spaces, while a non-shared page frame belongs to just one. Notice that a non-shared page frame might belong to several lightweight processes referring to the same memory descriptor. Shared page frames are typically created when a process spawns a child or when two or more processes access the same file by means of a shared memory mapping
PFRA algorithm considerations:
Free the "harmless" pages first: Pages included in disk and memory caches not referenced by any process should be reclaimed before pages belonging to the User Mode address spaces of the processes; in the former case, in fact, the page frame reclaiming can be done without modifying any Page Table entry. As we will see in the section "The Least Recently Used (LRU) Lists" later in this chapter, this rule is somewhat mitigated by introducing a "swap tendency factor." Make all pages of a User Mode process reclaimable: With the exception of locked pages, the PFRA must be able to steal any page of a User Mode process, including the anonymous pages. In this way, processes that have been sleeping for a long period of time will progressively lose all their page frames. Reclaim a shared page frame by unmapping at once all page table entries that reference it: When the PFRA wants to free a page frame shared by several processes, it clears all page table entries that refer to the shared page frame, and then reclaims the page frame. Reclaim "unused" pages only: The PFRA uses a simplified Least Recently Used (LRU) replacement algorithm to classify pages as active and inactive. If a page has not been accessed for a long time, the probability that it will be accessed in the near future is low and it can be considered "inactive;" on the other hand, if a page has been accessed recently, the probability that it will continue to be accessed is high and it must be considered as "active." The main idea behind the LRU algorithm is to associate a counter storing the age of the page with each page in RAM that is, the interval of time elapsed since the last access to the page. This counter allows the PFRA to reclaim the oldest page of any process. Some computer platforms provide sophisticated support for LRU algorithms; unfortunately, 80 x 86 processors do not offer such a hardware feature, thus the Linux kernel cannot rely on a page counter that keeps track of the age of every page. To cope with this restriction, Linux takes advantage of the Accessed bit included in each Page Table entry, which is automatically set by the hardware when the page is accessed; moreover, the age of a page is represented by the position of the page descriptor in one of two different lists Active vs inactive memory The PFRA classifies memory into active and inactive. /proc/meminfo provides the current active and inactive memory. Here is an eg -
[root@server]# cat /proc/meminfo MemTotal: 132093140 kB MemFree: 591272 kB Buffers: 239488 kB Cached: 125650056 kB SwapCached: 0 kB Active: 25157088 kB Inactive: 103410468 kB HighTotal: 0 kB HighFree: 0 kB <snip>
This shows that active memory is 25 GB while inactive is 103 GB. Starting from Linux Kernel 2.6.xx onwards these functions are handled by pdflush and kswapd and the Page Frame Reclaiming Algorithm.
Linux maintains two lists in the page cache - the Active List and the Inactive List. The Page Frame Reclaiming Algorithm gathers pages that were recently accessed in the active list so that it will not scan them when looking for a page frame to reclaim. Conversely, the PFRA gathers the pages that have not been accessed for a long time in the inactive list. Of course, pages should move from the inactive list to the active list and back, according to whether they are being accessed.
Clearly, two page states ("active" and "inactive") are not sufficient to describe all possible access patterns. For instance, suppose a logger process writes some data in a page once every hour. Although the page is "inactive" for most of the time, the access makes it "active," thus denying the reclaiming of the corresponding page frame, even if it is not going to be accessed for an entire hour. Of course, there is no general solution to this problem, because the PFRA has no way to predict the behavior of User Mode processes; however, it seems reasonable that pages should not change their status on every single access.
The PG_referenced flag in the page descriptor is used to double the number of accesses required to move a page from the inactive list to the active list; it is also used to double the number of "missing accesses" required to move a page from the active list to the inactive list (see below). For instance, suppose that a page in the inactive list has the PG_referenced flag set to 0. The first page access sets the value of the flag to 1, but the page remains in the inactive list. The second page access finds the flag set and causes the page to be moved in the active list. If, however, the second access does not occur within a given time interval after the first one, the page frame reclaiming algorithm may reset the PG_referenced flag.
The active and inactive memory can be used to infer a bunch of stuff as follows -
Active (file) can be used to determine what portion of the disk cache is actively in use Inactive memory is the best candidate for reclaiming memory and so low inactive memory would mean that you are low on memory and the kernwl may have to swap out process pages, or swap out the cache to disk or in the worst case if it runs out of swap space then begin killing processes Rough PFRA algo Memory space is divided into memory used by processes, disk cache, free memory and memory used by kernel Periodically pages from the memory are marked as active or inactive based on whether they have been accessed recently Periodically or if memory is low then pages are reclaimed from the inactive list first and then the active list as follows - The page to be reclaimed must be swappable, syncable or discardable If the page is dirty it is written out to disk and reclaimed If the page belongs to a user mode process it is written out to swap space Pages are reclaimed using the active/inactive list in an LRU manner as described above Depending on the "swappiness" variable, pages of a user mode process maybe preferred over disk cache pages when reclaiming memory If there are very few discardable and syncable pages and the swap space is full then the system runs out of memory and invokes the OOM killer OOM Despite the PFRA effort to keep a reserve of free page frames, it is possible for the pressure on the virtual memory subsystem to become so high that all available memory becomes exhausted. This situation could quickly induce a freeze of every activity in the system: the kernel keeps trying to free memory in order to satisfy some urgent request, but it does not succeed because the swap areas are full and all disk caches have already been shrunken. As a consequence, no process can proceed with its execution, thus no process will eventually free up the page frames that it owns.
To cope with this dramatic situation, the PFRA makes use of a so-called out of memory (OOM) killer, which selects a process in the system and abruptly kills it to free its page frames. The OOM killer is like a surgeon that amputates the limb of a man to save his life: losing a limb is not a nice thing, but sometimes there is nothing better to do.
The out_of_memory() when the free memory is very low and the PFRA has not succeeded in reclaiming any page frames. The function selects a victim among the existing processes, then invokes oom_kill_process() to perform the sacrifice.
Of course the process is not picked at random. The selected process should satisfy several requisites:
The victim should own a large number of page frames, so that the amount of memory that can be freed is significant. (As a countermeasure against the "fork-bomb" processes, the function considers the amount of memory eaten by all children owned by the parent, too) Killing the victim should lose a small amount of workit is not a good idea to kill a batch process that has been working for hours or days. The victim should be a low static priority processthe users tend to assign lower priorities to less important processes. The victim should not be a process with root privileges they usually perform important tasks. The victim should not directly access hardware devices (such as the X Window server), because the hardware could be left in an unpredictable state. The victim cannot be swapper (process 0), init (process 1), or any other kernel thread. The function scans every process in the system, uses an empirical formula to compute from the above rules a value that denotes how good selecting that process is, and returns the process descriptor address of the "best" candidate for eviction. Then the out_of_memory( ) function invokes oom_kill_process( ) to send a deadly signal - usually SIGKILL; either to a child of that process or, if this is not possible, to the process itself. The oom_kill_process( ) function also kills all clones (referring here to LWPs) that share the same memory descriptor with the selected victim One indicator of running into OOM is to look at the combination of free memory, Inactive memory and free swap in /proc/meminfo as explained below -
[user@server ~]$ cat /proc/meminfo MemTotal: 12305340 kB MemFree: 79968 kB Buffers: 165376 kB Cached: 3500048 kB SwapCached: 0 kB Active: 9819744 kB Inactive: 1787500 kB SwapTotal: 2048276 kB SwapFree: 2047668 kB Dirty: 80108 kB
In the above example -
Free memory is 79 MB This means whenever the kernel requires additional memory it must reclaim memory by swapping out process pages to swap or writing file pages to disk. The primary candidate for reclaiming memory is the Inactive memory which in the above case is a healthy 1.7 GB. If there is no inactive memory to reclaim then the kernel would look at active memory. Lastly if no active file pages are available to write to disk, and all active process pages have been swapped out OR the swap space is full then the OOM killer would be activated. If your server ever has an issue where the OOM killer was activated you have seriously neglected your memory monitoring. This condition must NEVER take place on any server.
Using drop_cache Check http://linux-mm.org/Drop_Caches to learn how to drop the page cache in Linux. You can experiment with this command in combination with the output of meminfo (Cached, Active memory, Inactive memory) and fincore to determine how much of your data store is typically loaded into cache within how much time and what portion of it is extremely active.
Measuring memory utilization atop MEM | tot 126.0G | free 6.4G | cache 113.2G | dirty 924.9M | buff 394.7M | slab 1.8G SWP | tot 2.0G | free 2.0G | vmcom 10.1G | vmlim 65.0G |
atop shows the system memory as a whole broken up as -
MEM tot: total physical memory free: free physical memory cache: amount of memory used for the page cache dirty: amount of page cache that is currently dirty buff: the amount of memory used for filesystem meta data slab: amount of memory being used for kernel mallocs SWP tot: total amount of swap space on disk free: amount of swap space that is free PAG (appears only if there is data to show in the interval) scan: number of scanned pages due to the fact that free memory drops below a particular threshold stall: number of times that the kernel tries to reclaim pages due to an urgent need swin/swout: Also the number of memory pages the system read from swap space ('swin') and the number of memory pages the system wrote to swap space ('swout') are shown /proc/meminfo > cat /proc/meminfo [bhavin.t@mongo-history-1 ~]$ cat /proc/meminfo MemTotal: 62168992 kB MemFree: 287900 kB Buffers: 12264 kB Cached: 59953784 kB SwapCached: 0 kB Active: 29934172 kB Inactive: 30168836 kB Active(anon): 137004 kB Inactive(anon): 24 kB Active(file): 29797168 kB Inactive(file): 30168812 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 10832 kB Writeback: 0 kB AnonPages: 136704 kB Mapped: 863444 kB Shmem: 68 kB Slab: 1526616 kB SReclaimable: 1498556 kB SUnreclaim: 28060 kB KernelStack: 1520 kB PageTables: 110824 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 31084496 kB Committed_AS: 393640 kB VmallocTotal: 34359738367 kB VmallocUsed: 116104 kB VmallocChunk: 34359620200 kB DirectMap4k: 63496192 kB DirectMap2M: 0 kB
MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) MemFree: The sum of LowFree+HighFree (essentially total free memory) Buffers: Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so) Cached: Page cache. Doesn't include SwapCached SwapCached: Memory that once was swapped out, is swapped back in but still also is in the swapfile (if memory is needed it doesn't need to be swapped out AGAIN because it is already in the swapfile. This saves I/O) Active: Memory that has been used more recently and usually not reclaimed unless absolutely necessary. anon: active memory that is not file backed (check http://www.linuxjournal.com/article/10678 for a desc on anonymous pages). this will typically be the higher chunk of active memory on a app server machine which does not have a db file: active memory that is file backed. this will typically be the higher chunk of active memory on a data store machine that reads / writes from disk Inactive: Memory which has been less recently used. It is more eligible to be reclaimed for other purposes HighTotal/HighFree: Highmem is all memory above ~860MB of physical memory. Highmem areas are for use by userspace programs, or for the pagecache. The kernel must use tricks to access this memory, making it slower to access than lowmem. LowTotal/LowFree: Lowmem is memory which can be used for everything that highmem can be used for, but it is also available for the kernel's use for its own data structures. Among many other things, it is where everything from the Slab is allocated. Bad things happen when you're out of lowmem. SwapTotal: total amount of swap space available SwapFree: Memory which has been evicted from RAM, and is temporarily on the disk Dirty: Memory which is waiting to get written back to the disk Writeback: Memory which is actively being written back to the disk Mapped: files which have been mmaped, such as libraries Slab: in-kernel data structures cache Committed_AS — The total amount of memory, in kilobytes, estimated to complete the workload. This value represents the worst case scenario value, and also includes swap memory. PageTables — The total amount of memory, in kilobytes, dedicated to the lowest page table level. VMallocTotal — The total amount of memory, in kilobytes, of total allocated virtual address space. VMallocUsed — The total amount of memory, in kilobytes, of used virtual address space. VMallocChunk — The largest contiguous block of memory, in kilobytes, of available virtual address space. /proc/vmstat This file shows detailed virtual memory statistics from the kernel. Most of the counters explained below are available only if you have kernel compiled with VM_EVENT_COUNTERS config option turned on. That's so because most of the parameters below have no function for the kernel itself, but are useful for debugging and statistics purposes
[user@server proc]$ cat /proc/vmstat nr_anon_pages 2014051 nr_mapped 11691 nr_file_pages 890051 nr_slab_reclaimable 128956 nr_slab_unreclaimable 9670 nr_page_table_pages 5628 nr_dirty 15158 nr_writeback 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 4737 pgpgin 2280999 pgpgout 76513335 pswpin 0 pswpout 152 pgalloc_dma 1 pgalloc_dma32 27997500 pgalloc_normal 108826482 pgfree 136842914 pgactivate 24663564 pgdeactivate 8083378 pgfault 266178186 pgmajfault 2228 pgrefill_dma 0 pgrefill_dma32 6154199 pgrefill_normal 19920764 pgsteal_dma 0 pgsteal_dma32 0 pgsteal_normal 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma32 3203616 pgscan_kswapd_normal 4431168 pgscan_direct_dma 0 pgscan_direct_dma32 1056 pgscan_direct_normal 2368 pginodesteal 0 slabs_scanned 391808 kswapd_steal 7598807 kswapd_inodesteal 0 pageoutrun 49495 allocstall 37 pgrotated 154
nr_anon_pages nr_mapped - pages mapped by files nr_file_pages - nr_slab_reclaimable - pages from the kernel slab memory usage that can be reclaimed nr_slab_unreclaimable 9670 - pages from the kernel slab memory usage that cannot be reclaimed nr_page_table_pages 5628 - pages allocated to page tables nr_dirty 15158 - dirty pages waiting to be written to disk nr_writeback 0 - dirty pages currently being written to disk nr_unstable 0 nr_bounce 0 nr_vmscan_write 4737 pgpgin 2280999 - page ins since last boot pgpgout 76513335 - page outs since last boot pswpin 0 - swap ins since last boot pswpout 152 - swap outs since last boot pgalloc_dma 1 pgalloc_dma32 27997500 pgalloc_normal 108826482 pgfree 136842914 - page frees since last boot pgactivate 24663564 - page activations since last boot pgdeactivate 8083378 - page deactivations since last boot pgfault 266178186 - minor faults since last boot pgmajfault 2228 - major faults since last boot pgrefill_dma 0 pgrefill_dma32 6154199 pgrefill_normal 19920764 - page refills since last boot pgsteal_dma 0 pgsteal_dma32 0 pgsteal_normal 0 pgscan_kswapd_dma 0 pgscan_kswapd_dma32 3203616 pgscan_kswapd_normal 4431168 - pages scanned by kswapd since boot pgscan_direct_dma 0 pgscan_direct_dma32 1056 pgscan_direct_normal 2368 - pages reclaimed since boot pginodesteal 0 slabs_scanned 391808 kswapd_steal 7598807 kswapd_inodesteal 0 pageoutrun 49495 - number of times kswapd called page reclaim allocstall 37 - number of times page reclaim was called directly (low memory) pgrotated 154 Of the above the following are important -
nr_dirty - signifies amount of memory waiting to be written to disk. If you have a power loss you can expect to lose this much data, unless your application has some form of journaling (eg Transaction logs) pswpin & pswpout - should never be positive. This means the kernel is having to write memory pages to disk to free up memory for some other process or disk cache. One may see occasional swapping on the machine due to the kernel swapping out a process page in favor of a disk cache page due to the swappiness factor set pgfree 136842914 - page frees since last boot pgactivate 24663564 - page activations since last boot pgdeactivate 8083378 - page deactivations since last boot pgmajfault 2228 - shouldnt be too many. page faults are normal. but major page faults are generally rare. major page faults may involve disk activity and hence should ideally not occur frequently. allocstall 37 - should not occur often. This signifies that the periodic running of kswapd could not free up adequate pages and for these many number of times the kernel had to trigger page reclaims manually vmstat [user@server ~]$ vmstat -a -S M 5 procs ----------memory--------- --swap- ----io--- -system- ----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 2 6593 394 115893 0 0 690 767 1 2 32 12 53 4 0 3 0 2 6585 394 115901 0 0 204 6310 6005 23103 29 15 53 2 0 2 1 2 6549 394 115912 0 0 182 4707 5102 20867 38 13 48 2 0
[user@server ~]$ vmstat -S M 5 procs ----------memory--------- --swap- ----io--- -system- ----cpu----- r b swpd free inact active si so bi bo in cs us sy id wa st 4 0 2 6390 48082 71527 0 0 690 767 1 2 32 12 53 4 0 2 0 2 6383 48082 71534 0 0 87 4614 5859 21944 34 13 51 1 0 3 0 2 6376 48082 71543 0 0 137 5164 4925 19994 23 12 64 1 0
vmstat shows the following memory related fields -
swpd: the amount of virtual memory used free: the amount of idle memory buff: the amount of memory used as buffers cache: the amount of memory used as cache inact: the amount of inactive memory (-a option) active: the amount of active memory (-a option) /proc - per process memory stats [user@server ~]$ cat /proc/7278/status <snip> FDSize: 1024 Groups: 26 VmPeak: 3675100 kB VmSize: 3675096 kB VmLck: 0 kB VmHWM: 81160 kB VmRSS: 81156 kB VmData: 944 kB VmStk: 84 kB VmExe: 3072 kB VmLib: 2044 kB VmPTE: 244 kB StaBrk: 0ac3c000 kB Brk: 0ac82000 kB StaStk: 7fff35863220 kB Threads: 1
FDSize: Number of file descriptor slots currently allocated. Groups: Supplementary group list. VmPeak: Peak virtual memory size. VmSize: Virtual memory size. VmLck: Locked memory size (see mlock(3)). VmHWM: Peak resident set size ("high water mark"). VmRSS: Resident set size. VmData, VmStk, VmExe: Size of data, stack, and text segments. VmLib: Shared library code size. VmPTE: Page table entries size (since Linux 2.6.10). Threads: Number of threads in process containing this thread. [user@server ~]$ cat /proc/7278/statm 918774 20289 20186 768 0 257 0
Table 1-2: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. Field Content size total program size (pages) (same as VmSize in status) resident size of memory portions (pages) (same as VmRSS in status) shared number of pages that are shared (i.e. backed by a file) trs number of pages that are 'code' (not including libs; broken, includes data segment) lrs number of pages of library (always 0 on 2.6) drs number of pages of data/stack (including libs; broken, includes library text) dt number of dirty pages (always 0 on 2.6) ..............................................................................
[user@server ~]$ cat /proc/7278/stat 7278 (postgres) S 1 7257 7257 0 -1 4202496 36060376 10845160168 0 749 20435 137212 158536835 39143290 15 0 1 0 50528579 3763298304 20289 18446744073709551615 4194304 7336916 140734091375136 18446744073709551615 225773929891 0 0 19935232 84487 0 0 0 17 2 0 0 12
Table 1-3: Contents of the stat files (as of 2.6.22-rc3) .............................................................................. Field Content pid process id tcomm filename of the executable state state (R is running, S is sleeping, D is sleeping in an uninterruptible wait, Z is zombie, T is traced or stopped) ppid process id of the parent process pgrp pgrp of the process sid session id tty_nr tty the process uses tty_pgrp pgrp of the tty flags task flags min_flt number of minor faults cmin_flt number of minor faults with child's *maj_flt number of major faults cmaj_flt number of major faults with child's utime user mode jiffies stime kernel mode jiffies cutime user mode jiffies with child's waited for cstime kernel mode jiffies with child's waited for priority priority level nice nice level num_threads number of threads it_real_value (obsolete, always 0) start_time time the process started after system boot vsize virtual memory size rss resident set memory size rsslim current limit in bytes on the rss start_code address above which program text can run end_code address below which program text can run start_stack address of the start of the stack esp current value of ESP eip current value of EIP pending bitmap of pending signals (obsolete) blocked bitmap of blocked signals (obsolete) sigign bitmap of ignored signals (obsolete) sigcatch bitmap of catched signals (obsolete) wchan address where process went to sleep 0 (place holder) 0 (place holder) exit_signal signal to send to parent thread on exit task_cpu which CPU the task is scheduled on rt_priority realtime priority policy scheduling policy (man sched_setscheduler) blkio_ticks time spent waiting for block IO ..............................................................................
[user@server ~]$ cat /proc/7278/smaps 00400000-00700000 r-xp 00000000 08:03 6424710 /usr/local/postgres/pgsql8.2.3/bin/postgres Size: 3072 kB Rss: 2108 kB Shared_Clean: 2108 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Swap: 0 kB 2b3a78a33000-2b3b5493f000 rw-s 00000000 00:09 1114115 /SYSV0052e2c1 (deleted) Size: 3603504 kB Rss: 2129800 kB Shared_Clean: 54300 kB Shared_Dirty: 2075500 kB Private_Clean: 0 kB Private_Dirty: 0 kB Swap: 0 kB
smaps shows for each process the memory distribution for various libraries, data and programs and what portion of it is shared. for instance above I have snipped out two entries from postgres showing that the postgres executable is taking 2 MB of shared memory and the postgres internal cache is taking 2 GB of shared memory.
[root@server]# pmap -x 30850 | less Address Kbytes RSS Dirty Mode Mapping 0000000040000000 36 0 0 r-x-- java 0000000040108000 8 8 8 rwx-- java 0000000041373000 1469492 1469352 1469352 rwx-- [ anon ] 000000071ae00000 45120 44740 44740 rwx-- [ anon ] 000000071da10000 38848 0 0 ----- [ anon ] 0000000720000000 3670016 3670016 3670016 rwx-- [ anon ] 00007ff67286f000 12 0 0 ----- [ anon ] 00007ff672872000 1016 24 24 rwx-- [ anon ] 00007ff672970000 12 0 0 ----- [ anon ] 00007ff672973000 1016 24 24 rwx-- [ anon ] ...
top Mem: 132093140k total, 128645860k used, 3447280k free, 413200k buffers Swap: 2096472k total, 2596k used, 2093876k free, 122750144k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA nFLT nDRT COMMAND 21827 postgres 15 0 3626m 2.1g 2.0g S 15.5 1.6 10:20.94 1.5g 3072 32m 0 0 postgres 19638 postgres 15 0 3626m 2.1g 2.0g S 14.5 1.6 14:03.23 1.5g 3072 32m 0 0 postgres 27306 postgres 15 0 3618m 2.1g 2.0g R 11.6 1.6 9:34.90 1.5g 3072 24m 0 0 postgres 19673 postgres 15 0 3626m 2.1g 2.0g S 10.9 1.6 8:40.20 1.5g 3072 32m 0 0 postgres 22068 postgres 15 0 3626m 2.1g 2.0g S 10.2 1.6 15:20.89 1.5g 3072 32m 0 0 postgres 4339 postgres 15 0 3618m 2.1g 2.0g S 8.6 1.6 8:04.42 1.5g 3072 24m 0 0 postgres
top shows the following global memory related fields -
Mem: physical memory (total, used, free, used for buffers) Swap: swap space (total, used, free, amount of memory used for disk cache?? - this last value is uncertain) top shows the following memory related fields per process - %MEM – Memory usage (RES) - A task's currently used share of available physical memory VIRT – Virtual Image (kb) - The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out. (Note: you can define the STATSIZE=1 environment variable and the VIRT will be calculated from the /proc/#/state VmSize field.) SWAP – Swapped size (kb) - The swapped out portion of a task's total virtual memory image. SWAP is calculated as VIRT-RES. This field shows incorrect data in my opinion RES – Resident size (kb) - The non-swapped physical memory a task has used. RES = CODE + DATA. RES includes SHR CODE – Code size (kb) - The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS. DATA – Data+Stack size (kb) - The amount of physical memory devoted to other than executable code, also known as the 'data resident set' size or DRS. SHR – Shared Mem size (kb) - The amount of shared memory used by a task. It simply reflects memory that could be potentially shared with other processes. nFLT – Page Fault count - The number of major page faults that have occurred for a task. A page fault occurs when a process attempts to read from or write to a virtual page that is not currently present in its address space. A major page fault is when disk access is involved in making that page available. nDRT – Dirty Pages count - The number of pages that have been modified since they were last written to disk. Dirty pages must be written to disk before the corresponding physical memory location can be used for some other virtual page. vmtouch vmtouch is a great tool for learning about and controlling the file system cache of unix and unix-like systems. You can use it to learn about how much of a file is in memory, what files should be evicted from memory etc
Example 1 How much of the /bin/ directory is currently in cache?
$ vmtouch /bin/ Files: 92 Directories: 1 Resident Pages: 348/1307 1M/5M 26.6% Elapsed: 0.003426 seconds
Example 2 We have 3 big datasets, a.txt, b.txt, and c.txt but only 2 of them will fit in memory at once. If we have a.txt and b.txt in memory but would now like to work with b.txt and c.txt, we could just start loading up c.txt but then our system would evict pages from both a.txt (which we want) and b.txt (which we don't want).
So let's give the system a hint and evict a.txt from memory, making room for c.txt:
$ vmtouch -ve a.txt Evicting a.txt
Files: 1 Directories: 0 Evicted Pages: 42116 (164M) Elapsed: 0.076824 seconds
fincore fincore is a great tool that can be used to measure how much of a file is currently in the disk cache. This can be used to determine rough cache usage for an application.
root@xxxxxx:/var/lib/mysql/blogindex# fincore --pages=false --summarize --only-cached * stats for CLUSTER_LOG_2010_05_21.MYI: file size=93840384 , total pages=22910 , cached pages=1 , cached size=4096, cached perc=0.004365 stats for CLUSTER_LOG_2010_05_22.MYI: file size=417792 , total pages=102 , cached pages=1 , cached size=4096, cached perc=0.980392 stats for CLUSTER_LOG_2010_05_23.MYI: file size=826368 , total pages=201 , cached pages=1 , cached size=4096, cached perc=0.497512 stats for CLUSTER_LOG_2010_05_24.MYI: file size=192512 , total pages=47 , cached pages=1 , cached size=4096, cached perc=2.127660 stats for CLUSTER_LOG_2010_06_03.MYI: file size=345088 , total pages=84 , cached pages=43 , cached size=176128, cached perc=51.190476 stats for CLUSTER_LOG_2010_06_04.MYD: file size=1478552 , total pages=360 , cached pages=97 , cached size=397312, cached perc=26.944444 stats for CLUSTER_LOG_2010_06_04.MYI: file size=205824 , total pages=50 , cached pages=29 , cached size=118784, cached perc=58.000000 Optimizing memory usage Optimizing memory usage consists of the following principles -
Ensure memory never runs out This can be achieved as follows -
Reduce your applications memory footprint. Try to use memory efficiently within your application. Use memory efficient data structures Perform proper capacity planning to determine memory usage during peak loads. Account for concurrently running processes and disk cache requirements of all running applications and potential impact of backup scripts or scripts that read/write large quantities of data from disk, which typically wipe out your disk cache if they are not configured to use O_DIRECT along with free memory requirements of the OS and other applications Use LWPs (threads) instead of processes for concurrency. Even when using processes try to use shared memory for inter-process communication and common data Monitor your memory utilization and determine if any process is hogging up too much memory No swapping Your server should NEVER swap. Note: Some swapping may occur if the kernel is configured to prefer utilization of memory towards disk caches as opposed to process space. However this too should be minimal. Swapping is bad and should never never occur You may want to tune /proc/sys/vm/swappiness. Details at http://www.westnet.com/~gsmith/content/linux-pdflush.htm and http://people.redhat.com/nhorman/papers/rhel4_vm.pdf Infact if you have done appropriate capacity planning you can actually configure your machine without any swap space altogether (refer http://david415.wordpress.com/2009/11/21/running-linux-with-no-swap/). Offcourse you have to be dead certain about your capacity planning Rare page faulting Page faults will occur when a new process is forked or created or when an existing process requests for additional memory allocation. However these situations in a constantly running server should not be too high, and therefore you should see very rare page faulting on the server, especially major faults (minor faults are fine - they require no disk access. major faults may require some disk access)
Maximize disk cache hits for reads Determine your disk cache needs appropriately. For eg if your database is 100 GB and about 10% of it is accessed about 95% of the time then you need about 10 GB to be available in the disk cache. You can use fincore to determine what portion of your data files are in the page cache at any given time. This helps you determine which files are being cached and what % of them remain in cache in a warm cache scenario. You can also use drop_cache and fincore dumps in combination to determine the rate at which your data files are being loaded into cache which also gives you some idea on what portion of your data is most frequently accessed. Lastly if you have implemented flashcache in LRU mode you can judge from the cache hits and misses in flashcache as to roughly how much data is frequently used and what amount of RAM you may wish to dedicate to the disk cache. Disk cache replacement algorithms are LRU. This works for normal access scenarios. However if you have a backup script or some script that is either reading or writing a large amount of data onetime, it can wipe out your disk cache. It is important to therefore optimize sequential backup or simliar scripts to use O_DIRECT mode of data transfer which bypasses the disk cache and prevents wiping it out. It is also important to run these types of processes during times where your system has minimal IO load Avoid double buffering - it only ends up wasting space. For eg if your application itself caches data from the disk in its own heap then you are wasting twice the memory for most frequently used data. It may make sense for you to manage your own app cache since you can save significant cpu cycles by caching data in the exact format it is required in, as opposed to the page cache that linux maintains. However in that case you may want to optimize memory usage by using O_DIRECT mode for accessing that data when it is not available in your app cache. Alternatively you may leave caching entirely to the Operating system and not maintain any page cache in your own application Backup processes or processes that linearly read / write to a large chunk of the disk should not wipe out the page cache. The disk cache replacement algorithm is provided by the operating system as an LRU algorithm. Now currently it is not practical to change this, however based on your application a different algorithm may be more optimal. Check the disk IO operations. If they are predominantly write operations with minimal reads then your disk cache is likely serving most of the reads. If however the disk IO operations are predominantly reads you could improve performance by optimizing your disk cache Each application has varying data access behavior. Combining multiple applications on a single machine such that they have to share their dynamic memory results in sub-optimal memory utilization. For eg consider a web hosting server consisting of site data and database. Typically website data comprises of static files, html files, code files, images, videos, media etc. The total amount of website content is much more than database content (on typical servers we have noticed site data can run in terrabytes while database sizes are in gigabytes). However databases generate more IOPs than site content. This means if you deploy them on the same server and they have to share their RAM, a greater portion of the RAM will be dedicated towards site data than database data, even though the latter is more frequently accessed and should get a larger portion of the disk cache. Just by separating these two applications you can optimize the data that gets stored in the disk cache Here is a small probabilistic model that shows how segregating the disk caches for different applications can help optimize memory usage - Say we have 6 blocks of data - A, B, C, D, E, F A gets accessed 10 times every minute B, C, D, E, F get accessed 2 times every minute Lets say your cache can only store one of these blocks The probability of finding A in the cache is 50% while the probability of finding one of B, C, D, E, F is also 50% however if the cache has any of B, C, D, E or F, then the cache will only be useful 2 times every minute. Therefore half the times the cache is being used sub-optimally Tuning disk cache Refer to http://www.westnet.com/~gsmith/content/linux-pdflush.htm and http://www.cyberciti.biz/faq/linux-kernel-tuning-virtual-memory-subsystem/ for tips on tuning various parameters of pdflush and kswapd which control the page reclaim logic of your disk cache
Maximize disk cache merges for writes If your application does not require to fsync() data immediately, you can gain a considerable performance boost out of the write-back nature of the disk cache. Most databases, mail servers etc fsync() on each write since they cannot afford to lose data. However, if you have created a custom data store, you may have a model wherein you write the same data onto multiple nodes synchronously. In this case all of the nodes do not need to fsync() the data, since a replica is available incase of a total node failure. In this situation the total number of writes will reduce since many updates will cancel previous writes and multiple writes can be merged together resulting in lesser IOPs which helps both incase of flash drives and SATA drives.
Tuning kernel vm parameters Refer to http://www.mjmwired.net/kernel/Documentation/sysctl/vm.txt
|