US9047019B2 - Shared temporary storage management in a shared disk database cluster - Google Patents

Shared temporary storage management in a shared disk database cluster Download PDF

Info

Publication number
US9047019B2
US9047019B2 US13/291,157 US201113291157A US9047019B2 US 9047019 B2 US9047019 B2 US 9047019B2 US 201113291157 A US201113291157 A US 201113291157A US 9047019 B2 US9047019 B2 US 9047019B2
Authority
US
United States
Prior art keywords
shared
temporary storage
nodes
units
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/291,157
Other versions
US20130117526A1 (en
Inventor
Colin Joseph FLORENDO
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sybase Inc
Original Assignee
Sybase Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sybase Inc filed Critical Sybase Inc
Priority to US13/291,157 priority Critical patent/US9047019B2/en
Assigned to SYBASE, INC. reassignment SYBASE, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FLORENDO, COLIN JOSEPH
Publication of US20130117526A1 publication Critical patent/US20130117526A1/en
Application granted granted Critical
Publication of US9047019B2 publication Critical patent/US9047019B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • G06F3/0622Securing storage systems in relation to access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0637Permissions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present invention relates to information processing environments and, more particularly, to shared temporary storage management in a shared disk database cluster.
  • Computers are very powerful tools for storing and providing access to vast amounts of information.
  • Computer databases are a common mechanism for storing information on computer systems while providing easy data access to users.
  • a typical database is an organized collection of related information stored as “records” having “fields” of information.
  • a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
  • a database management system or DBMS is typically provided as a software cushion or layer.
  • the DBMS shields the database user from knowing or even caring about underlying hardware-level details.
  • all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level.
  • SDC Shared Disk Cluster
  • Each computer system in a SDC is also referred to as a node, and all nodes in the cluster communicate with each other, typically through private interconnects.
  • SDC database systems provide for transparent, continuous availability of the applications running on the cluster with support for failover amongst servers. More and more, mission-critical systems, which store information on database systems, such as data warehousing systems, are run from such clusters. Products exist for building, managing, and using a data warehouse, such as Sybase IQ available from Sybase, Inc. of Dublin, Calif.
  • Distributed query processing allows SQL queries submitted to one node of the cluster to be processed by multiple cluster nodes, allowing more hardware resources to be utilized to improve performance.
  • Distributed query processing typically requires the nodes to share temporary, intermediate data pertaining to the query in order to process and assemble the final result set, after which the temporary data is discarded.
  • the temporary data consumes space of one or more network storage devices specifically configured for temporary storage use by the database cluster.
  • the simplest solution for shared temporary storage management is to statically reserve a fixed portion of the shared temporary store for each node in the database cluster. This makes exclusive access rights unambiguous, as each node will use its reserved portion of the shared temporary storage.
  • the invention includes system, method, computer program product embodiments and combinations and sub-combinations thereof for temporary storage management in a shared disk database cluster. Included is the reserving of units on-demand and of variable size from shared temporary storage space in the SDC. The utilization of the reserved units of the shared temporary storage space is tracked, and the shared temporary storage space is administered based on the tracking.
  • FIG. 1 illustrates an example of a clustered server configuration.
  • FIG. 2 illustrates a block diagram of an overall approach for shared temporary storage management in accordance with embodiments of the invention.
  • FIGS. 3 a , 3 b , 3 c , 3 d , and 3 e illustrate block diagram representations of an example of shared temporary storage states in accordance with embodiments of the invention.
  • FIG. 4 illustrates an example computer useful for implementing components of embodiments of the invention.
  • the present invention relates to a system, method, computer program product embodiments and combinations and sub-combinations thereof for shared temporary storage management in a shared disk database cluster.
  • FIG. 1 illustrates an example 100 of a shared disk database cluster, which, in general, handles concurrent data loads and queries from users/applications via independent data processing nodes connected to shared data storage.
  • shared database objects can be written by one user and queried by multiple users simultaneously. Many objects of this type may exist and be in use at the same time in the database.
  • Each node is an instance of a database server typically running on its own host computer.
  • a primary node, or coordinator 140 manages all global read-write transactions. Storage data is kept in main or permanent storage 170 which is shared between all nodes, and similarly, temporary data can be shared using shared temporary storage 180 .
  • the coordinator 140 further maintains a global catalog, storing information about DDL (data definition language) operations, in catalog store 142 as a master copy for catalog data. Changes in the global catalog are communicated from the coordinator 140 to other nodes 150 via a table version (TLV) log kept inside shared main store 170 through an asynchronous mechanism referred to herein as ‘catalog replication’.
  • TLV table version
  • catalog replication the coordinator 140 writes TLV log records which other nodes 150 read and replay to update their local catalog.
  • the one or more secondary nodes 150 a , 150 b , 150 c , etc. each have their own catalog stores 152 a , 152 b , 152 c , etc., configured locally to maintain their own local catalogs.
  • the secondary nodes 150 may be designated as reader (read-only) nodes and writer (read-write) nodes, with one secondary node designated as a failover node to assume the coordinator role if the current coordinator 140 is unable to continue. All nodes are connected in a mesh configuration where each node is capable of executing remote procedure calls (RPCs) on other nodes.
  • RPCs remote procedure calls
  • the nodes that participate in the cluster share messages and data via Inter-node Communication (INC) 160 , which provides a TCPIP-based communication link between cluster nodes
  • each node has its own local transaction manager.
  • the transaction manager on the coordinator 140 acts as both local and global transaction manager. Clients may connect to any of the cluster nodes as individual servers, each being capable of running read only transaction on its own using its local transaction manager.
  • secondary nodes 150 can run queries and update inside a write transaction, but only the global transaction manager on the coordinator 140 is allowed to start and finish the transaction (known as global transaction). Secondary nodes 150 internally request the coordinator 140 to begin and commit global transactions on their behalf. Committed changes from write transactions become visible to secondary nodes 150 via catalog replication.
  • the coordinator node 140 manages separate storage pools for permanent 170 and shared temporary storage 180 . All permanent database objects are stored on the shared permanent storage pool 170 . The lifespan of permanent objects is potentially infinite, as they persist until explicitly deleted. The state and contents of the shared permanent storage pool 170 must persist across coordinator 140 restarts, crash recovery, and coordinator node failover, as well as support backup and recovery operations.
  • All nodes also manage their own separate storage pool 154 a , 154 h , 154 c for local temporary data, which consists of one or more local storage devices.
  • Local temporary database objects exist only for the duration of a query. Local temporary database objects are not shared between nodes, and therefore the state and contents of the local temporary storage pool, which are isolated to each single node, do not need to persist across node crashes or restarts.
  • Distributed query processing typically requires the nodes 140 , 150 to share temporary, intermediate data pertaining to the query in order to process and assemble the final result set, after which the temporary data is discarded.
  • each node in the cluster must have read-write access to a portion of the total shared temporary storage 180 for writing result data to share with other nodes.
  • These portions are logical subsets of the total temporary storage 180 and are physically embodied as stripes across multiple physical disks.
  • the management of shared temporary storage 180 for distributed database queries is achieved in a manner that can adapt to dynamic configuration and workload conditions.
  • FIG. 2 a block flow diagram illustrates an overall approach in accordance with an embodiment of the invention for temporary storage management to support symmetric distributed query processing in a shared disk database cluster, where any node in the cluster can execute a distributed query.
  • the approach includes reserving units on-demand and of variable size from the shared temporary space in the SDC (block 210 ), tracking the utilization of reserved units of the shared temporary space (block 220 ), and administering the shared temporary space based on the tracking (block 230 ).
  • a node 150 requests a shared temporary space reservation when needed from the coordinator 140 via IPC calls.
  • the coordinator 140 provides discrete reservation units and controls the size of the reservation units carved out of the global shared temporary storage pool 180 based on the remaining free space, the number of nodes in the cluster, and the current reservations for the requesting node.
  • an initial size of the reserved space is inversely proportional to the number of nodes, to allow all nodes in the SDC to have a fair chance at getting an initial allocation, under an assumption that all nodes in the SDC will eventually require an allocation.
  • the percentage is a hard coded value but could easily be replaced with a field-adjustable parameter, if desired.
  • the calculated initial request size is greater than a predetermined maximum request size (e.g., a field adjustable database option)
  • a predetermined maximum request size e.g., a field adjustable database option
  • the size is rounded down to the maximum.
  • the calculated initial request size is smaller than a minimum request size, it is rounded up to the minimum.
  • Any subsequent requests by a node follow a different calculation, with the initial request sizing potentially larger than subsequent request sizes, to minimize “ramp up” time to reach a shared temporary storage usage steady state when starting a node.
  • a suitable formula representation for the calculation is:
  • Subsequent request size (subsequent request percentage)*(remaining free space ⁇ initial reservation pool size).
  • the ‘subsequent request percentage’ refers to a flat percentage of the remaining space in the shared temporary storage pool (e.g., a hard coded value or a field-adjustable parameter)
  • the ‘remaining free space’ refers to the total free, unreserved space in the global shared temporary storage pool
  • the ‘initial reservation pool size’ refers to the total space in the global shared temporary storage size multiplied by the initial request reservation percentage ((total storage)*(initial request reservation percentage)).
  • the reserved space size gets smaller as less space is available, thus throttling the reservation unit sizes as space runs short, and allows for nodes with a larger shared temporary workload to reserve as much space as needed.
  • all running nodes always retain at least one reservation unit.
  • reservation unit chains provide a data structure and methodology used to track the discrete reservation units of space reserved for a particular node.
  • the reservation units are added to the chain as a result of the space reservation requests, and removed, such as in a last in, first out (LIFO) manner, via timed expiration of the last link in the chain, with all reservation unit state changes being transactional.
  • LIFO last in, first out
  • the coordinator 140 maintains an active reservation unit chain for each node, including the coordinator node, with each active reservation unit marked with the transaction ID of the reservation unit creation, and each active reservation unit chain marked with a timestamp of the last space reservation request for that node. Similarly, an expired reservation unit chain is maintained for each node, with each expired reservation unit marked with the transaction ID at the time of the expiration event.
  • the tracking of all expired reservation units for the SDC in the coordinator 140 achieve transactional persistence of this management data, as is well appreciated by those skilled in the art.
  • an active reservation unit chain is maintained representing the reservation units received by only that node.
  • Each active reservation unit is marked with the transaction ID of the reservation unit creation and with a timestamp of the last space reservation request for that node.
  • An expired reservation unit chain also is maintained representing the reservation units expired by only that node.
  • the reservation units received by a node provide free space for allocation of shared temporary data.
  • the tracking of the allocation of free space to an object by a node occurs via a bitmap referred to herein as a freelist.
  • a freelist a bitmap referred to herein as a freelist.
  • each bit in the freelist represents a logical disk block, which is part of the logical storage space consisting of all the physical disk blocks of the network storage devices configured for that storage pool, where a bit value of 0 means that the logical block is free, while a bit value of 1 means that the logical block is in use.
  • the coordinator node 140 owns a global shared temp freelist for tracking which blocks are globally free, meaning they are free to be reserved for exclusive use by the SDC nodes. Being the owner of the global shared temp freelist, the coordinator 140 has to maintain the global shared temp freelist block space in synchronization with changes to the shared temp store space (adding and removing files, RO/RW state, etc.), persist the global shared temp freelist state on coordinator 140 shutdown and failover, perform crash recovery of the global shared temp freelist state in the event of coordinator 140 crash, manage freelist space reservations for all nodes, including itself, return freelist space reservations to the global shared temp freelist once they are released by a node, and return space for shared temporary data logically freed by one node back to the node the space is reserved for.
  • No node including the coordinator 140 , can allocate space for temporary objects directly from the global shared temp freelist; instead, each node must first reserve space for exclusive use and allocate blocks from that reserved space. Accordingly, the coordinator 140 maintains two shared temp freelists, the global shared temp freelist and a proxy shared temp freelist (tracking reserved space usage for the coordinator 140 ).
  • Each secondary node 150 also maintains its own shared temp proxy freelist, which tracks reserved space usage for that node.
  • the initial shared temp proxy freelist on secondary nodes is empty, meaning it contains no free space. Further, as secondary node proxy freelist contents are not expected to persist across secondary server 150 restarts, every time a node is restarted, (e.g., the initial startup, or a startup after clean shut down or crash), the shared temp proxy freelist returns to a “no space available” state.
  • reservation units are bitmaps which set bit positions corresponding to logical disk blocks represented in the global shared temp freelist to be reserved for a particular node. All blocks in the shared temp freelist not freed via space reservation are essentially masked; the secondary node proxy freelist has them marked in use.
  • the object may be logically destroyed by the node which allocated it or it may be logically destroyed by another node participating in the distributed query.
  • the space allocated for that object must be returned to the proxy freelist of the node which allocated it. This is done via a global shared temp garbage collection mechanism, which recycles all non-locally freed shared temp space to the respective owner through the coordinator 140 .
  • the garbage collection logic handles processing of the global free bitmaps maintained by each node.
  • each node maintains one global free bitmap, which is a bitmap representing logical storage blocks to be sent to the coordinator 140 for return either to the node for which the block is reserved or returned to the global shared temporary storage pool if the block is no longer reserved for any node.
  • Secondary nodes 150 send the contents of their global free bitmap to the coordinator node 140 periodically via IPC during the database garbage collection event. The result of a secondary node garbage collection event is to transfer the logical storage blocks from that node's global free bitmap to the coordinator node's global free bitmap.
  • the coordinator node's garbage collection event then periodically processes its global free bitmap and returns de-allocated blocks to the node for which the blocks are still reserved or to the global pool if the blocks are no longer reserved.
  • shared temporary storage is recycled to the proper owner after de-allocation by any node.
  • Reservation unit chain expiration allows for the returning of unused shared temp space currently reserved for a node back to the global storage pool, such as when nodes temporarily hold more reservation units than usual to accommodate a busy period. Reservation unit chain expiration uses the following logic.
  • Each node including the coordinator node, controls its own reservation unit expiration, driven periodically by a timed database event and based on the local timestamp and an expiration period of each active reservation unit chain.
  • the expiration period is a value expressing the amount of time the current reservation unit chain is valid, and the reservation unit chain timestamp is reset every time a new reservation unit is added to the chain via a successful reservation unit request to the coordinator.
  • the current reservation unit chain is considered expired when the amount of time past the timestamp exceeds the expiration period.
  • expiring reservation unit bitmaps are compared against the shared temp proxy freelist, and all bits in the expiring reservation unit which are currently marked 0 (free) in the shared temp proxy freelist will be marked 1 (to be returned to the global pool) in the global free bitmap, and marked 1 (in use) in the shared temp proxy freelist.
  • the secondary nodes 150 communicate reservation unit expiration and the global free bitmap to the coordinator 140 via an IPC call as part of the periodic garbage collection event.
  • This IPC call sends the unique IDs of each expired reservation unit.
  • the secondary node 150 keeps the expired reservation unit in its local reservation unit chain until it receives a positive acknowledgement from the coordinator 140 that the coordinator 140 has processed the expiration. This is considered necessary to eliminate race conditions and failure scenarios where a given storage block would be left unaccounted for. Expired reservation unit chains for all nodes are maintained persistently across coordinator 140 restart and failover, and reservation unit expiration for the coordinator 140 is directly processed by the coordinator 140 during its periodic garbage collection event.
  • the coordinator 140 performs comparisons as part of the garbage collection event (e.g., using bitwise logical AND comparisons).
  • One comparison involves comparing the global free bitmap against the bitmaps in the active reservation unit chains for all secondary nodes and producing a single return blocks bitmap, which records all the blocks to be returned to all secondary nodes.
  • the global free bitmap is also compared against the bitmaps in the active reservation unit chain for the coordinator. The result of that comparison is used to mark blocks free in the coordinator's shared temp proxy freelist, meaning they are free for the coordinator to allocate.
  • the global free bitmap is compared against the bitmaps in the expired reservation unit chains for all nodes, and the result is used to mark blocks free in the global shared temp freelist, meaning they are free for reservation by specific nodes.
  • the coordinator 140 writes the return blocks bitmap to the shared permanent store, and adds a record to a global version synchronization log.
  • This shared log structure on the shared permanent store is used to propagate metadata changes from the coordinator 140 to all secondary nodes 150 , such as is capable in the environment of the aforementioned Sybase IQ.
  • each secondary node 150 compares the return blocks bitmap against their active and expired reservation units. Any blocks matching with active reservation units are freed in the node's shared temp proxy freelist. Any blocks within the node's expired reservation units are added to the global free bitmap for return to the coordinator 140 , and removed from that node's local expired allocation units. Any blocks outside of these conditions are ignored.
  • FIGS. 3 a , 3 b , 3 c , 3 d , and 3 e block diagram representations of an example of shared temporary storage states in accordance with embodiments of the invention are illustrated.
  • the example refers to a cluster configuration having three nodes, namely, a coordinator, a server 1 and a server 2 .
  • a global freelist 310 has three sets of blocks marked as used, corresponding to reservation units 320 and 330 reserved in response to separate requests by server 1 and included in its active reservation unit chain 340 , and reservation unit 350 reserved in response to a request by server 2 and included in its active reservation unit chain 360 .
  • Object allocations 371 of blocks 1000 - 1999 and 3500 - 3999 of server 1 are reflected as such in the server 1 proxy freelist 370 , while the proxy freelist 380 of server 2 reflects the object allocation 381 of server 2 in blocks 2000 - 2499 .
  • server 1 expiring its second reservation unit and recording de-allocation of unused storage on that unit, with the bitmap 385 updated, as represented in FIG. 3 b .
  • the global free bitmap 385 of server 1 records the deallocation of objects allocated by server 1
  • the global free bitmap 390 of server 2 records the deallocations by server 2 , which includes objects originally allocated by server 1 and server 2 .
  • the blocks freed by server 1 and server 2 are removed from the global free bitmaps 385 and 390 , respectively, and the secondary servers transfer the global free blocks to coordinator global free bitmap 395 of the coordinator, as represented in FIG. 3 d .
  • the server 2 proxy freelist is updated to free blocks owned by server 2 .
  • the server 1 deallocation is recorded in the return blocks bitmap, which is written to the version synchronization log.
  • server 1 processes the log, server 1 frees these blocks in its proxy freelist.
  • the coordinator further clears the expired allocation 345 and global storage in the global shared temp freelist 310 for blocks formerly allocated by server 1 .
  • the shared temporary storage management system capably reserves shared temporary store portions for nodes on-demand, rather than statically. This allows for intelligent space management that can adapt to dynamic configuration and workload conditions. Further, the space used for temporary objects is always eventually freed, so that under a steady workload, a minimum steady-state space reservation per node can be maintained. Also, any given node may reserve multiple portions of the shared temporary storage to accommodate a temporary peak in distributed query processing workload, and then return the portion(s) to the global storage pool after the workload peak subsides. This allows for more economical storage configurations, such as in situations where peak workloads typically occur on a subset of nodes at any given time, rather than all nodes simultaneously. In addition, through the throttling of portion sizes as resources in the shared temporary storage pool are consumed due to increased workload in the database cluster, space efficiency is increased, and unnecessary global starvation of the temporary storage space is prevented during peak workloads.
  • FIG. 4 illustrates an example computer system 400 , such as capable of acting as the nodes in the cluster of FIG. 1 , in which the present invention, or portions thereof, can be implemented as computer-readable code.
  • the methods illustrated by flowchart of FIG. 2 can be implemented in system 400 .
  • Various embodiments of the invention are described in terms of this example computer system 400 . After reading this description, it will become apparent to a person skilled in the relevant art how to implement the invention using other computer systems and/or computer architectures.
  • Computer system 400 includes one or more processors, such as processor 404 .
  • Processor 404 can be a special purpose or a general purpose processor.
  • Processor 404 is connected to a communication infrastructure 406 (for example, a bus or network).
  • Computer system 400 also includes a main memory 408 , preferably random access memory (RAM), and may also include a secondary memory 410 .
  • Secondary memory 410 may include, for example, a hard disk drive 412 , a removable storage drive 414 , and/or a memory stick.
  • Removable storage drive 414 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like.
  • the removable storage drive 414 reads from and/or writes to a removable storage unit 418 in a well known manner.
  • Removable storage unit 418 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 414 .
  • removable storage unit 418 includes a computer usable storage medium having stored therein computer software and/or data.
  • secondary memory 410 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 400 .
  • Such means may include, for example, a removable storage unit 422 and an interface 420 .
  • Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 422 and interfaces 420 which allow software and data to be transferred from the removable storage unit 422 to computer system 400 .
  • Computer system 400 may also include a communications interface 424 .
  • Communications interface 424 allows software and data to be transferred between computer system 400 and external devices.
  • Communications interface 424 may include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, or the like.
  • Software and data transferred via communications interface 424 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communications interface 424 . These signals are provided to communications interface 424 via a communications path 426 .
  • Communications path 426 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link or other communications channels.
  • computer program medium and “computer usable medium” are used to generally refer to media such as removable storage unit 418 , removable storage unit 422 , and a hard disk installed in hard disk drive 412 . Signals carried over communications path 426 can also embody the logic described herein. Computer program medium and computer usable medium can also refer to memories, such as main memory 408 and secondary memory 410 , which can be memory semiconductors (e.g. DRAMs, etc.). These computer program products are means for providing software to computer system 400 .
  • Computer programs are stored in main memory 408 and/or secondary memory 410 . Computer programs may also be received via communications interface 424 . Such computer programs, when executed, enable computer system 400 to implement the present invention as discussed herein. In particular, the computer programs, when executed, enable processor 404 to implement the processes of the present invention, such as the method illustrated by the flowchart of FIG. 2 . Accordingly, such computer programs represent controllers of the computer system 400 . Where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 400 using removable storage drive 414 , interface 420 , hard drive 412 or communications interface 424 .
  • the invention is also directed to computer program products comprising software stored on any computer useable medium.
  • Such software when executed in one or more data processing device, causes a data processing device(s) to operate as described herein.
  • Embodiments of the invention employ any computer useable or readable medium, known now or in the future.
  • Examples of computer useable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nanotechnological storage device, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.).

Abstract

System, method, computer program product embodiments and combinations and sub-combinations thereof for temporary storage management in a shared disk database cluster are provided. Included is the reserving of units on-demand and of variable size from shared temporary storage space in the SDC. The utilization of the reserved units of the shared temporary storage space is tracked, and the shared temporary storage space is administered based on the tracking.

Description

BACKGROUND
1. Field of the Invention
The present invention relates to information processing environments and, more particularly, to shared temporary storage management in a shared disk database cluster.
2. Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy data access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level.
In recent years, users have demanded that database systems be continuously available, with no downtime, as they are frequently running applications that are critical to business operations. In response, distributed database systems have been introduced. Architectures for building multi-processor, high performance transactional database systems include a Shared Disk Cluster (SDC), in which multiple computer systems, each with a private memory share a common collection of disks. Each computer system in a SDC is also referred to as a node, and all nodes in the cluster communicate with each other, typically through private interconnects.
In general, SDC database systems provide for transparent, continuous availability of the applications running on the cluster with support for failover amongst servers. More and more, mission-critical systems, which store information on database systems, such as data warehousing systems, are run from such clusters. Products exist for building, managing, and using a data warehouse, such as Sybase IQ available from Sybase, Inc. of Dublin, Calif.
Among the advances of data warehouse systems in a shared disk cluster is the ability to achieve distributed query processing. Distributed query processing allows SQL queries submitted to one node of the cluster to be processed by multiple cluster nodes, allowing more hardware resources to be utilized to improve performance. Distributed query processing typically requires the nodes to share temporary, intermediate data pertaining to the query in order to process and assemble the final result set, after which the temporary data is discarded. The temporary data consumes space of one or more network storage devices specifically configured for temporary storage use by the database cluster. In a shared disk cluster, the simplest solution for shared temporary storage management is to statically reserve a fixed portion of the shared temporary store for each node in the database cluster. This makes exclusive access rights unambiguous, as each node will use its reserved portion of the shared temporary storage.
However, such fixed portion allocation does not provide intelligent space management, which can adapt to dynamic configuration and workload conditions. Accordingly, a need exists for a flexible and dynamic approach to shared temporary storage management in an SDC. The present invention addresses these and other needs.
BRIEF SUMMARY
Briefly stated, the invention includes system, method, computer program product embodiments and combinations and sub-combinations thereof for temporary storage management in a shared disk database cluster. Included is the reserving of units on-demand and of variable size from shared temporary storage space in the SDC. The utilization of the reserved units of the shared temporary storage space is tracked, and the shared temporary storage space is administered based on the tracking.
Further embodiments, features, and advantages of the invention, as well as the structure and operation of the various embodiments of the invention, are described in detail below with reference to accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate embodiments of the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the relevant art(s) to make and use the invention.
FIG. 1 illustrates an example of a clustered server configuration.
FIG. 2 illustrates a block diagram of an overall approach for shared temporary storage management in accordance with embodiments of the invention.
FIGS. 3 a, 3 b, 3 c, 3 d, and 3 e illustrate block diagram representations of an example of shared temporary storage states in accordance with embodiments of the invention.
FIG. 4 illustrates an example computer useful for implementing components of embodiments of the invention.
The features and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements. Generally, the drawing in which an element first appears is indicated by the leftmost digit(s) in the corresponding reference number.
DETAILED DESCRIPTION
The present invention relates to a system, method, computer program product embodiments and combinations and sub-combinations thereof for shared temporary storage management in a shared disk database cluster.
FIG. 1 illustrates an example 100 of a shared disk database cluster, which, in general, handles concurrent data loads and queries from users/applications via independent data processing nodes connected to shared data storage. In operation, shared database objects can be written by one user and queried by multiple users simultaneously. Many objects of this type may exist and be in use at the same time in the database.
Each node is an instance of a database server typically running on its own host computer. A primary node, or coordinator 140, manages all global read-write transactions. Storage data is kept in main or permanent storage 170 which is shared between all nodes, and similarly, temporary data can be shared using shared temporary storage 180. The coordinator 140 further maintains a global catalog, storing information about DDL (data definition language) operations, in catalog store 142 as a master copy for catalog data. Changes in the global catalog are communicated from the coordinator 140 to other nodes 150 via a table version (TLV) log kept inside shared main store 170 through an asynchronous mechanism referred to herein as ‘catalog replication’. In catalog replication, the coordinator 140 writes TLV log records which other nodes 150 read and replay to update their local catalog.
Thus, the one or more secondary nodes 150 a, 150 b, 150 c, etc., each have their own catalog stores 152 a, 152 b, 152 c, etc., configured locally to maintain their own local catalogs. The secondary nodes 150 may be designated as reader (read-only) nodes and writer (read-write) nodes, with one secondary node designated as a failover node to assume the coordinator role if the current coordinator 140 is unable to continue. All nodes are connected in a mesh configuration where each node is capable of executing remote procedure calls (RPCs) on other nodes. The nodes that participate in the cluster share messages and data via Inter-node Communication (INC) 160, which provides a TCPIP-based communication link between cluster nodes
To handle transactions originating on a node, each node has its own local transaction manager. The transaction manager on the coordinator 140 acts as both local and global transaction manager. Clients may connect to any of the cluster nodes as individual servers, each being capable of running read only transaction on its own using its local transaction manager. For write transactions, secondary nodes 150 can run queries and update inside a write transaction, but only the global transaction manager on the coordinator 140 is allowed to start and finish the transaction (known as global transaction). Secondary nodes 150 internally request the coordinator 140 to begin and commit global transactions on their behalf. Committed changes from write transactions become visible to secondary nodes 150 via catalog replication.
The coordinator node 140 manages separate storage pools for permanent 170 and shared temporary storage 180. All permanent database objects are stored on the shared permanent storage pool 170. The lifespan of permanent objects is potentially infinite, as they persist until explicitly deleted. The state and contents of the shared permanent storage pool 170 must persist across coordinator 140 restarts, crash recovery, and coordinator node failover, as well as support backup and recovery operations.
All nodes also manage their own separate storage pool 154 a, 154 h, 154 c for local temporary data, which consists of one or more local storage devices. Local temporary database objects exist only for the duration of a query. Local temporary database objects are not shared between nodes, and therefore the state and contents of the local temporary storage pool, which are isolated to each single node, do not need to persist across node crashes or restarts.
Distributed query processing typically requires the nodes 140, 150 to share temporary, intermediate data pertaining to the query in order to process and assemble the final result set, after which the temporary data is discarded. Thus, each node in the cluster must have read-write access to a portion of the total shared temporary storage 180 for writing result data to share with other nodes. These portions are logical subsets of the total temporary storage 180 and are physically embodied as stripes across multiple physical disks. In accordance with embodiments of the present invention, the management of shared temporary storage 180 for distributed database queries is achieved in a manner that can adapt to dynamic configuration and workload conditions.
Referring now to FIG. 2, a block flow diagram illustrates an overall approach in accordance with an embodiment of the invention for temporary storage management to support symmetric distributed query processing in a shared disk database cluster, where any node in the cluster can execute a distributed query. The approach includes reserving units on-demand and of variable size from the shared temporary space in the SDC (block 210), tracking the utilization of reserved units of the shared temporary space (block 220), and administering the shared temporary space based on the tracking (block 230).
In operation, a node 150 requests a shared temporary space reservation when needed from the coordinator 140 via IPC calls. In response, the coordinator 140 provides discrete reservation units and controls the size of the reservation units carved out of the global shared temporary storage pool 180 based on the remaining free space, the number of nodes in the cluster, and the current reservations for the requesting node.
In an embodiment, an initial size of the reserved space is inversely proportional to the number of nodes, to allow all nodes in the SDC to have a fair chance at getting an initial allocation, under an assumption that all nodes in the SDC will eventually require an allocation. A suitable formula representation is: Initial request size=1/(number of nodes)*(initial request reservation percentage), where the ‘number of nodes’ refers to a total of the number of nodes currently in the SDC, and the ‘initial request reservation percentage’ refers to the percent of the total shared temporary space to reserve for initial allocations for all nodes. In an embodiment, the percentage is a hard coded value but could easily be replaced with a field-adjustable parameter, if desired.
If the calculated initial request size is greater than a predetermined maximum request size (e.g., a field adjustable database option), the size is rounded down to the maximum. Conversely, if the calculated initial request size is smaller than a minimum request size, it is rounded up to the minimum.
Any subsequent requests by a node follow a different calculation, with the initial request sizing potentially larger than subsequent request sizes, to minimize “ramp up” time to reach a shared temporary storage usage steady state when starting a node. A suitable formula representation for the calculation is:
Subsequent request size=(subsequent request percentage)*(remaining free space−initial reservation pool size). The ‘subsequent request percentage’ refers to a flat percentage of the remaining space in the shared temporary storage pool (e.g., a hard coded value or a field-adjustable parameter), the ‘remaining free space’ refers to the total free, unreserved space in the global shared temporary storage pool, and the ‘initial reservation pool size’ refers to the total space in the global shared temporary storage size multiplied by the initial request reservation percentage ((total storage)*(initial request reservation percentage)). In this manner, the reserved space size gets smaller as less space is available, thus throttling the reservation unit sizes as space runs short, and allows for nodes with a larger shared temporary workload to reserve as much space as needed. Preferably, in order to reduce overhead for small distributed query workloads, all running nodes always retain at least one reservation unit.
In an embodiment, reservation unit chains provide a data structure and methodology used to track the discrete reservation units of space reserved for a particular node. Suitably, the reservation units are added to the chain as a result of the space reservation requests, and removed, such as in a last in, first out (LIFO) manner, via timed expiration of the last link in the chain, with all reservation unit state changes being transactional.
In operation, the coordinator 140 maintains an active reservation unit chain for each node, including the coordinator node, with each active reservation unit marked with the transaction ID of the reservation unit creation, and each active reservation unit chain marked with a timestamp of the last space reservation request for that node. Similarly, an expired reservation unit chain is maintained for each node, with each expired reservation unit marked with the transaction ID at the time of the expiration event. The tracking of all expired reservation units for the SDC in the coordinator 140 achieve transactional persistence of this management data, as is well appreciated by those skilled in the art.
On secondary nodes 150, an active reservation unit chain is maintained representing the reservation units received by only that node. Each active reservation unit is marked with the transaction ID of the reservation unit creation and with a timestamp of the last space reservation request for that node. An expired reservation unit chain also is maintained representing the reservation units expired by only that node.
The reservation units received by a node provide free space for allocation of shared temporary data. The tracking of the allocation of free space to an object by a node occurs via a bitmap referred to herein as a freelist. Thus, when a node allocates disk space for an object, it updates its freelist to set the blocks for that object in use, and when a node de-allocates disk space for an object, it updates the freelist to set the blocks for that object free. In an embodiment, each bit in the freelist represents a logical disk block, which is part of the logical storage space consisting of all the physical disk blocks of the network storage devices configured for that storage pool, where a bit value of 0 means that the logical block is free, while a bit value of 1 means that the logical block is in use.
The coordinator node 140 owns a global shared temp freelist for tracking which blocks are globally free, meaning they are free to be reserved for exclusive use by the SDC nodes. Being the owner of the global shared temp freelist, the coordinator 140 has to maintain the global shared temp freelist block space in synchronization with changes to the shared temp store space (adding and removing files, RO/RW state, etc.), persist the global shared temp freelist state on coordinator 140 shutdown and failover, perform crash recovery of the global shared temp freelist state in the event of coordinator 140 crash, manage freelist space reservations for all nodes, including itself, return freelist space reservations to the global shared temp freelist once they are released by a node, and return space for shared temporary data logically freed by one node back to the node the space is reserved for.
No node, including the coordinator 140, can allocate space for temporary objects directly from the global shared temp freelist; instead, each node must first reserve space for exclusive use and allocate blocks from that reserved space. Accordingly, the coordinator 140 maintains two shared temp freelists, the global shared temp freelist and a proxy shared temp freelist (tracking reserved space usage for the coordinator 140).
Each secondary node 150 also maintains its own shared temp proxy freelist, which tracks reserved space usage for that node. The initial shared temp proxy freelist on secondary nodes is empty, meaning it contains no free space. Further, as secondary node proxy freelist contents are not expected to persist across secondary server 150 restarts, every time a node is restarted, (e.g., the initial startup, or a startup after clean shut down or crash), the shared temp proxy freelist returns to a “no space available” state. When a node receives a reservation unit in response to a space reservation request, it sets bit positions in its shared temp proxy freelist to 0 (i.e., “free”, allowing shared temporary data allocations to use that free space) for each bit with a 1 value (reserved) in the reservation unit. In an embodiment, reservation units are bitmaps which set bit positions corresponding to logical disk blocks represented in the global shared temp freelist to be reserved for a particular node. All blocks in the shared temp freelist not freed via space reservation are essentially masked; the secondary node proxy freelist has them marked in use.
Depending on the usage of the object which was allocated, the object may be logically destroyed by the node which allocated it or it may be logically destroyed by another node participating in the distributed query. When a shared temporary object is logically destroyed, the space allocated for that object must be returned to the proxy freelist of the node which allocated it. This is done via a global shared temp garbage collection mechanism, which recycles all non-locally freed shared temp space to the respective owner through the coordinator 140.
The garbage collection logic handles processing of the global free bitmaps maintained by each node. In an embodiment, each node maintains one global free bitmap, which is a bitmap representing logical storage blocks to be sent to the coordinator 140 for return either to the node for which the block is reserved or returned to the global shared temporary storage pool if the block is no longer reserved for any node. Secondary nodes 150 send the contents of their global free bitmap to the coordinator node 140 periodically via IPC during the database garbage collection event. The result of a secondary node garbage collection event is to transfer the logical storage blocks from that node's global free bitmap to the coordinator node's global free bitmap. The coordinator node's garbage collection event then periodically processes its global free bitmap and returns de-allocated blocks to the node for which the blocks are still reserved or to the global pool if the blocks are no longer reserved. Thus, shared temporary storage is recycled to the proper owner after de-allocation by any node.
Reservation unit chain expiration allows for the returning of unused shared temp space currently reserved for a node back to the global storage pool, such as when nodes temporarily hold more reservation units than usual to accommodate a busy period. Reservation unit chain expiration uses the following logic.
Each node, including the coordinator node, controls its own reservation unit expiration, driven periodically by a timed database event and based on the local timestamp and an expiration period of each active reservation unit chain. The expiration period is a value expressing the amount of time the current reservation unit chain is valid, and the reservation unit chain timestamp is reset every time a new reservation unit is added to the chain via a successful reservation unit request to the coordinator. At each garbage collection event, the current reservation unit chain is considered expired when the amount of time past the timestamp exceeds the expiration period.
When a reservation unit chain expires, the last reservation unit in the chain is removed, essentially shortening the chain in a LIFO manner, and the active reservation unit chain timestamp is reset. On all nodes, expiring reservation unit bitmaps are compared against the shared temp proxy freelist, and all bits in the expiring reservation unit which are currently marked 0 (free) in the shared temp proxy freelist will be marked 1 (to be returned to the global pool) in the global free bitmap, and marked 1 (in use) in the shared temp proxy freelist.
Any shared temp blocks which were currently in use by the node (marked 1 in the shared temp proxy freelist) during the expiration will not be transferred to the global free bitmap immediately. When these blocks are eventually de-allocated they will not be marked free in the proxy freelist but instead transferred to the global free bitmap. This decision is made during the de-allocation process by checking whether the bit position of the block being de-allocated is accounted for in the active or the expired reservation unit chain. The end result of de-allocating an “expired” shared temp block is that the corresponding bit position in the global free bitmap set to 1 (to be globally freed) and the corresponding bit position in the expired reservation unit set to 0 (no longer accounted for as expired).
In the first stage of the garbage collection cycle, the secondary nodes 150 communicate reservation unit expiration and the global free bitmap to the coordinator 140 via an IPC call as part of the periodic garbage collection event. This IPC call sends the unique IDs of each expired reservation unit. The secondary node 150 keeps the expired reservation unit in its local reservation unit chain until it receives a positive acknowledgement from the coordinator 140 that the coordinator 140 has processed the expiration. This is considered necessary to eliminate race conditions and failure scenarios where a given storage block would be left unaccounted for. Expired reservation unit chains for all nodes are maintained persistently across coordinator 140 restart and failover, and reservation unit expiration for the coordinator 140 is directly processed by the coordinator 140 during its periodic garbage collection event.
The coordinator 140 performs comparisons as part of the garbage collection event (e.g., using bitwise logical AND comparisons). One comparison involves comparing the global free bitmap against the bitmaps in the active reservation unit chains for all secondary nodes and producing a single return blocks bitmap, which records all the blocks to be returned to all secondary nodes. The global free bitmap is also compared against the bitmaps in the active reservation unit chain for the coordinator. The result of that comparison is used to mark blocks free in the coordinator's shared temp proxy freelist, meaning they are free for the coordinator to allocate. Additionally, the global free bitmap is compared against the bitmaps in the expired reservation unit chains for all nodes, and the result is used to mark blocks free in the global shared temp freelist, meaning they are free for reservation by specific nodes.
At the end of each garbage collection event, the coordinator 140 writes the return blocks bitmap to the shared permanent store, and adds a record to a global version synchronization log. This shared log structure on the shared permanent store is used to propagate metadata changes from the coordinator 140 to all secondary nodes 150, such as is capable in the environment of the aforementioned Sybase IQ. When secondary nodes 150 detect a new return blocks entry in the global version synchronization log, each secondary node 150 compares the return blocks bitmap against their active and expired reservation units. Any blocks matching with active reservation units are freed in the node's shared temp proxy freelist. Any blocks within the node's expired reservation units are added to the global free bitmap for return to the coordinator 140, and removed from that node's local expired allocation units. Any blocks outside of these conditions are ignored.
Referring now to FIGS. 3 a, 3 b, 3 c, 3 d, and 3 e, block diagram representations of an example of shared temporary storage states in accordance with embodiments of the invention are illustrated. The example refers to a cluster configuration having three nodes, namely, a coordinator, a server 1 and a server 2. As shown in FIG. 3 a, a global freelist 310 has three sets of blocks marked as used, corresponding to reservation units 320 and 330 reserved in response to separate requests by server 1 and included in its active reservation unit chain 340, and reservation unit 350 reserved in response to a request by server 2 and included in its active reservation unit chain 360. Object allocations 371 of blocks 1000-1999 and 3500-3999 of server 1 are reflected as such in the server 1 proxy freelist 370, while the proxy freelist 380 of server 2 reflects the object allocation 381 of server 2 in blocks 2000-2499.
At some time later, change occurs, including server 1 expiring its second reservation unit and recording de-allocation of unused storage on that unit, with the bitmap 385 updated, as represented in FIG. 3 b. As represented in FIG. 3 c, the global free bitmap 385 of server 1 records the deallocation of objects allocated by server 1, while the global free bitmap 390 of server 2 records the deallocations by server 2, which includes objects originally allocated by server 1 and server 2.
For a garbage collection event on the secondary nodes, the blocks freed by server 1 and server 2 are removed from the global free bitmaps 385 and 390, respectively, and the secondary servers transfer the global free blocks to coordinator global free bitmap 395 of the coordinator, as represented in FIG. 3 d. Also, the server 2 proxy freelist is updated to free blocks owned by server 2. As represented in FIG. 3 e, for the garbage collection event on the coordinator, the server 1 deallocation is recorded in the return blocks bitmap, which is written to the version synchronization log. When server 1 processes the log, server 1 frees these blocks in its proxy freelist. The coordinator further clears the expired allocation 345 and global storage in the global shared temp freelist 310 for blocks formerly allocated by server 1.
In accordance with the embodiments of the invention, the shared temporary storage management system capably reserves shared temporary store portions for nodes on-demand, rather than statically. This allows for intelligent space management that can adapt to dynamic configuration and workload conditions. Further, the space used for temporary objects is always eventually freed, so that under a steady workload, a minimum steady-state space reservation per node can be maintained. Also, any given node may reserve multiple portions of the shared temporary storage to accommodate a temporary peak in distributed query processing workload, and then return the portion(s) to the global storage pool after the workload peak subsides. This allows for more economical storage configurations, such as in situations where peak workloads typically occur on a subset of nodes at any given time, rather than all nodes simultaneously. In addition, through the throttling of portion sizes as resources in the shared temporary storage pool are consumed due to increased workload in the database cluster, space efficiency is increased, and unnecessary global starvation of the temporary storage space is prevented during peak workloads.
Various aspects of the present invention can be implemented by software, firmware, hardware, or a combination thereof. FIG. 4 illustrates an example computer system 400, such as capable of acting as the nodes in the cluster of FIG. 1, in which the present invention, or portions thereof, can be implemented as computer-readable code. For example, the methods illustrated by flowchart of FIG. 2 can be implemented in system 400. Various embodiments of the invention are described in terms of this example computer system 400. After reading this description, it will become apparent to a person skilled in the relevant art how to implement the invention using other computer systems and/or computer architectures.
Computer system 400 includes one or more processors, such as processor 404. Processor 404 can be a special purpose or a general purpose processor. Processor 404 is connected to a communication infrastructure 406 (for example, a bus or network).
Computer system 400 also includes a main memory 408, preferably random access memory (RAM), and may also include a secondary memory 410. Secondary memory 410 may include, for example, a hard disk drive 412, a removable storage drive 414, and/or a memory stick. Removable storage drive 414 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like. The removable storage drive 414 reads from and/or writes to a removable storage unit 418 in a well known manner. Removable storage unit 418 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 414. As will be appreciated by persons skilled in the relevant art(s), removable storage unit 418 includes a computer usable storage medium having stored therein computer software and/or data.
In alternative implementations, secondary memory 410 may include other similar means for allowing computer programs or other instructions to be loaded into computer system 400. Such means may include, for example, a removable storage unit 422 and an interface 420. Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 422 and interfaces 420 which allow software and data to be transferred from the removable storage unit 422 to computer system 400.
Computer system 400 may also include a communications interface 424. Communications interface 424 allows software and data to be transferred between computer system 400 and external devices. Communications interface 424 may include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, or the like. Software and data transferred via communications interface 424 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communications interface 424. These signals are provided to communications interface 424 via a communications path 426. Communications path 426 carries signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link or other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage unit 418, removable storage unit 422, and a hard disk installed in hard disk drive 412. Signals carried over communications path 426 can also embody the logic described herein. Computer program medium and computer usable medium can also refer to memories, such as main memory 408 and secondary memory 410, which can be memory semiconductors (e.g. DRAMs, etc.). These computer program products are means for providing software to computer system 400.
Computer programs (also called computer control logic) are stored in main memory 408 and/or secondary memory 410. Computer programs may also be received via communications interface 424. Such computer programs, when executed, enable computer system 400 to implement the present invention as discussed herein. In particular, the computer programs, when executed, enable processor 404 to implement the processes of the present invention, such as the method illustrated by the flowchart of FIG. 2. Accordingly, such computer programs represent controllers of the computer system 400. Where the invention is implemented using software, the software may be stored in a computer program product and loaded into computer system 400 using removable storage drive 414, interface 420, hard drive 412 or communications interface 424.
The invention is also directed to computer program products comprising software stored on any computer useable medium. Such software, when executed in one or more data processing device, causes a data processing device(s) to operate as described herein. Embodiments of the invention employ any computer useable or readable medium, known now or in the future. Examples of computer useable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nanotechnological storage device, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.).
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be understood by those skilled in the relevant art(s) that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It should be understood that the invention is not limited to these examples. The invention is applicable to any elements operating as described herein. Accordingly, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (18)

What is claimed is:
1. A method for shared temporary storage management in a shared disk database cluster (SDC), the method comprising:
reserving units on-demand and of variable size from shared temporary storage space in the SDC, wherein the reserving controls a size of the reserved units based upon configuration of the SDC, remaining space of the shared temporary storage space, and a number of configured nodes, and wherein the shared temporary storage space exists independent from the reserving units;
tracking utilization of the reserved units of the shared temporary storage space, wherein coordinator and secondary nodes of the SDC maintain bitmaps which track the utilization of the reserved units, and each bit in the bitmaps represents a logical disk block and tracks whether the logical disk block is free or in use, and wherein the secondary nodes periodically send content of the bitmaps associated with the secondary nodes to the coordinator node; and
administering the shared temporary storage space based on the tracking.
2. The method of claim 1 wherein tracking further comprises updating the bitmaps with respect to space allocation activity for chains of reserved units.
3. The method of claim 1 wherein tracking further comprises updating the bitmaps with respect to timed expiration periods.
4. The method of claim 3, wherein the expiration period is reset when a new reservation unit is added to the bitmap.
5. The method of claim 1 wherein administering further comprises utilizing a garbage collection event.
6. The method of claim 5 wherein the garbage collection event handles expired reserved units and deallocated temporary storage space for reuse in the SDC.
7. A shared disk database cluster (SDC) system with temporary storage management comprising:
shared-disk storage; and
a plurality of data processing nodes reserving units on-demand and of variable size from shared temporary storage space of the shared-disk storage, wherein the data processing nodes reserving units controls a size of the reserved units based upon configuration of the SDC, remaining space of the shared temporary storage space, and a number of configured nodes, and wherein the shared temporary storage space exists independent from the reserving units, and administering the shared temporary storage space through tracked utilization of the reserved units, wherein coordinator and secondary nodes of the SDC maintain bitmaps which track the utilization of the reserved units, and each bit in the bitmaps represents a logical disk block and tracks whether the logical disk block is free or in use, and wherein the secondary13 nodes periodically send content of the bitmaps associated with the secondary nodes to the coordinator node.
8. The system of claim 7 wherein the updates to bitmaps include space allocation activity for chains of reserved units.
9. The system of claim 7 wherein the updates to the bitmaps include timed expiration periods.
10. The system of claim 7 wherein administering further comprises utilizing a garbage collection event.
11. The system of claim 10 wherein the garbage collection event handles expired reserved units and deallocated temporary storage space for reuse in the SDC.
12. A non-transitory computer-usable medium having instructions recorded thereon that, if executed by a computing device, cause the computing device to perform a method comprising:
reserving units on-demand and of variable size from shared temporary storage space in a shared disk cluster (SDC), wherein the reserving controls a size of the reserved units based upon configuration of the SDC, remaining space of the shared temporary storage space, and a number of configured nodes, and wherein the shared temporary storage space exists independent from the reserving units;
tracking utilization of the reserved units of the shared temporary storage space, wherein coordinator and secondary nodes of the SDC maintain bitmaps which track the utilization of the reserved units, and each bit in the bitmaps represents a logical disk block and tracks whether the logical disk block is free or in use, and wherein the secondary nodes periodically send content of the bitmaps associated with the secondary nodes to the coordinator node; and
administering the shared temporary storage space based on the tracking.
13. The computer-usable medium of claim 12 wherein tracking further comprises updating the bitmaps with respect to space allocation activity for chains of reserved units.
14. The computer-usable medium of claim 12 wherein tracking further comprises updating the bitmaps with respect to timed expiration periods.
15. The computer-usable medium of claim 12 wherein administering further comprises utilizing a garbage collection event to handle expired reserved units and deallocated temporary storage space for reuse in the SDC.
16. A method comprising: receiving an initial request, from a first node of a plurality of nodes in a shared disk database cluster (SDC) of nodes performing distributed processing of a query in the SDC, to reserve a portion of a shared temporary storage space, that exists independent from the plurality of nodes, wherein each of the plurality of nodes is allocated at least a portion of the shared temporary storage space; determining whether the initial request is less than a max request size; and providing a first discrete reservation unit to the first node based on the initial request, wherein the discrete reservation unit is of a size that is the lesser of the initial request of the max request size, wherein a second node in the cluster is allocated a second discrete reservation unit of a size different than the size of the first discrete reservation unit.
17. The method of claim 16, wherein the initial request is a request for a percentage of the portion of the shared temporary space allocated to the first node.
18. The method of claim 17, further comprising:
receiving a subsequent request from the first node, wherein the subsequent request is for a percentage of remaining free space in the shared temporary space; and
providing a subsequent discrete reservation unit to the first node, wherein a size of the subsequent discrete reservation unit is determined based on the remaining free space, the number of nodes in the cluster, and any current reservation units provided to the first node.
US13/291,157 2011-11-08 2011-11-08 Shared temporary storage management in a shared disk database cluster Active 2032-02-18 US9047019B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/291,157 US9047019B2 (en) 2011-11-08 2011-11-08 Shared temporary storage management in a shared disk database cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/291,157 US9047019B2 (en) 2011-11-08 2011-11-08 Shared temporary storage management in a shared disk database cluster

Publications (2)

Publication Number Publication Date
US20130117526A1 US20130117526A1 (en) 2013-05-09
US9047019B2 true US9047019B2 (en) 2015-06-02

Family

ID=48224544

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/291,157 Active 2032-02-18 US9047019B2 (en) 2011-11-08 2011-11-08 Shared temporary storage management in a shared disk database cluster

Country Status (1)

Country Link
US (1) US9047019B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103874A1 (en) * 2014-10-08 2016-04-14 Cloudera, Inc. Querying operating system state on multiple machines declaratively

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8700578B1 (en) * 2012-09-27 2014-04-15 Emc Corporation System and method for determining physical storage space of a deduplicated storage system
US9262415B2 (en) 2013-11-08 2016-02-16 Sybase, Inc. Cache efficiency in a shared disk database cluster
CN104504147B (en) * 2015-01-04 2018-04-10 华为技术有限公司 A kind of resource coordination method of data-base cluster, apparatus and system
US10423599B2 (en) * 2015-12-22 2019-09-24 Sap Se Global and local temporary database tables
US11449468B1 (en) * 2017-04-27 2022-09-20 EMC IP Holding Company LLC Enforcing minimum space guarantees in thinly-provisioned file systems
CN109871335A (en) * 2018-12-28 2019-06-11 努比亚技术有限公司 Terminal and its Memory Allocation control method, computer readable storage medium
RU2757165C1 (en) * 2021-04-18 2021-10-11 Арташес Валерьевич Икономов Method for obtaining personalized user information

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5193171A (en) * 1989-12-11 1993-03-09 Hitachi, Ltd. Method of managing space of peripheral storages and apparatus for the same
US5347514A (en) * 1993-03-26 1994-09-13 International Business Machines Corporation Processor-based smart packet memory interface
US5371882A (en) * 1992-01-14 1994-12-06 Storage Technology Corporation Spare disk drive replacement scheduling system for a disk drive array data storage subsystem
US20020133491A1 (en) * 2000-10-26 2002-09-19 Prismedia Networks, Inc. Method and system for managing distributed content and related metadata
US20040230762A1 (en) * 2003-05-15 2004-11-18 International Business Machines Corporation Methods, systems, and media for managing dynamic storage
US20070073988A1 (en) * 2005-09-27 2007-03-29 Hitachi, Ltd. Data processing system, data management method and storage system
US20070150677A1 (en) * 2005-12-28 2007-06-28 Yukiko Homma Storage system and snapshot management method
US20070156957A1 (en) * 2006-01-03 2007-07-05 Emc Corporation Methods, systems, and computer program products for dynamic mapping of logical units in a redundant array of inexpensive disks (RAID) environment
US20080046667A1 (en) * 2006-08-18 2008-02-21 Fachan Neal T Systems and methods for allowing incremental journaling
US20080177741A1 (en) * 2007-01-24 2008-07-24 Oracle International Corporation Maintaining item-to-node mapping information in a distributed system
US20080313641A1 (en) * 2007-06-18 2008-12-18 Hitachi, Ltd. Computer system, method and program for managing volumes of storage system
US20100049776A1 (en) * 2007-01-16 2010-02-25 Microsoft Corporation Fat directory structure for use in transaction safe file
US7702873B2 (en) * 2005-04-25 2010-04-20 Network Appliance, Inc. Managing common storage by allowing delayed allocation of storage after reclaiming reclaimable space in a logical volume
WO2010103569A1 (en) * 2009-03-11 2010-09-16 Hitachi, Ltd. Storage system and control method for the same, and program
US7844945B2 (en) * 2004-08-04 2010-11-30 Avocent Fremont Corp. Software and firmware adaptation for unanticipated/changing hardware environments

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5193171A (en) * 1989-12-11 1993-03-09 Hitachi, Ltd. Method of managing space of peripheral storages and apparatus for the same
US5371882A (en) * 1992-01-14 1994-12-06 Storage Technology Corporation Spare disk drive replacement scheduling system for a disk drive array data storage subsystem
US5347514A (en) * 1993-03-26 1994-09-13 International Business Machines Corporation Processor-based smart packet memory interface
US20020133491A1 (en) * 2000-10-26 2002-09-19 Prismedia Networks, Inc. Method and system for managing distributed content and related metadata
US7272613B2 (en) * 2000-10-26 2007-09-18 Intel Corporation Method and system for managing distributed content and related metadata
US7356655B2 (en) * 2003-05-15 2008-04-08 International Business Machines Corporation Methods, systems, and media for managing dynamic storage
US20040230762A1 (en) * 2003-05-15 2004-11-18 International Business Machines Corporation Methods, systems, and media for managing dynamic storage
US7743222B2 (en) * 2003-05-15 2010-06-22 International Business Machines Corporation Methods, systems, and media for managing dynamic storage
US7844945B2 (en) * 2004-08-04 2010-11-30 Avocent Fremont Corp. Software and firmware adaptation for unanticipated/changing hardware environments
US7702873B2 (en) * 2005-04-25 2010-04-20 Network Appliance, Inc. Managing common storage by allowing delayed allocation of storage after reclaiming reclaimable space in a logical volume
US7464232B2 (en) * 2005-09-27 2008-12-09 Hitachi, Ltd. Data migration and copying in a storage system with dynamically expansible volumes
US20070073988A1 (en) * 2005-09-27 2007-03-29 Hitachi, Ltd. Data processing system, data management method and storage system
US20090077331A1 (en) * 2005-09-27 2009-03-19 Hitachi, Ltd. Data migration and copying in a storage system with dynamically expansible volumes
US20070150677A1 (en) * 2005-12-28 2007-06-28 Yukiko Homma Storage system and snapshot management method
US7644244B2 (en) * 2005-12-28 2010-01-05 Hitachi, Ltd. Storage system and snapshot management method
US20070156957A1 (en) * 2006-01-03 2007-07-05 Emc Corporation Methods, systems, and computer program products for dynamic mapping of logical units in a redundant array of inexpensive disks (RAID) environment
US7574560B2 (en) * 2006-01-03 2009-08-11 Emc Corporation Methods, systems, and computer program products for dynamic mapping of logical units in a redundant array of inexpensive disks (RAID) environment
US7752402B2 (en) * 2006-08-18 2010-07-06 Isilon Systems, Inc. Systems and methods for allowing incremental journaling
US20080046667A1 (en) * 2006-08-18 2008-02-21 Fachan Neal T Systems and methods for allowing incremental journaling
US20100049776A1 (en) * 2007-01-16 2010-02-25 Microsoft Corporation Fat directory structure for use in transaction safe file
US8024383B2 (en) * 2007-01-16 2011-09-20 Mircrosoft Corporation Fat directory structure for use in transaction safe file
US20080177741A1 (en) * 2007-01-24 2008-07-24 Oracle International Corporation Maintaining item-to-node mapping information in a distributed system
US20080313641A1 (en) * 2007-06-18 2008-12-18 Hitachi, Ltd. Computer system, method and program for managing volumes of storage system
WO2010103569A1 (en) * 2009-03-11 2010-09-16 Hitachi, Ltd. Storage system and control method for the same, and program
US8271718B2 (en) * 2009-03-11 2012-09-18 Hitachi, Ltd. Storage system and control method for the same, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ASE Cluster Edition: Technical Overview White Paper, Sybase, Oct. 26, 2009, retrieved from https://web.archive.org/web/20091026072157/http://www.sybase.com/files/White-Papers/Sybase-ASE-Cluster-TechWP.pdf on Dec. 16, 2014 (28 pages). *
Sybase IQ Quick Start, Apr. 2009, retrieved from http://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc01014.1500/pdf/qsiq.pdf on Dec. 24, 2014 (30 pages). *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103874A1 (en) * 2014-10-08 2016-04-14 Cloudera, Inc. Querying operating system state on multiple machines declaratively
US9747333B2 (en) * 2014-10-08 2017-08-29 Cloudera, Inc. Querying operating system state on multiple machines declaratively

Also Published As

Publication number Publication date
US20130117526A1 (en) 2013-05-09

Similar Documents

Publication Publication Date Title
US9047019B2 (en) Shared temporary storage management in a shared disk database cluster
Verbitski et al. Amazon aurora: Design considerations for high throughput cloud-native relational databases
US8713046B2 (en) Snapshot isolation support for distributed query processing in a shared disk database cluster
US8126843B2 (en) Cluster-wide read-copy update system and method
US8954385B2 (en) Efficient recovery of transactional data stores
Lin et al. Towards a non-2pc transaction management in distributed database systems
US7917596B2 (en) Super master
Levandoski et al. High performance transactions in deuteronomy
US11132350B2 (en) Replicable differential store data structure
TW440769B (en) Parallel file system and method for granting byte range tokens
CN103106286B (en) Method and device for managing metadata
EP2378420A1 (en) Ownership reassignment in a shared-nothing database system
US10180812B2 (en) Consensus protocol enhancements for supporting flexible durability options
CN102906743A (en) Hybrid OLTP and OLAP high performance database system
US11468011B2 (en) Database management system
US11226876B2 (en) Non-blocking backup in a log replay node for tertiary initialization
US10803006B1 (en) Persistent memory key-value store in a distributed memory architecture
US11003550B2 (en) Methods and systems of operating a database management system DBMS in a strong consistency mode
US7290099B2 (en) Using parallelism for clear status track processing during error handling behavior in a storage system
US9424147B2 (en) System and method for supporting memory allocation control with push-back in a distributed data grid
US11204911B2 (en) Efficient and non-disruptive online defragmentation with record locking
US11789951B2 (en) Storage of data structures
WO2020024590A1 (en) Persistent memory key-value store in a distributed memory architecture
CN117851374A (en) Method and device for managing database system front image space
CN117851359A (en) Data processing method and device based on database cluster and computer equipment

Legal Events

Date Code Title Description
AS Assignment

Owner name: SYBASE, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FLORENDO, COLIN JOSEPH;REEL/FRAME:027316/0185

Effective date: 20111128

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: PAYER NUMBER DE-ASSIGNED (ORIGINAL EVENT CODE: RMPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8