WO2017135953A1

WO2017135953A1 - Shared memory access

Info

Publication number: WO2017135953A1
Application number: PCT/US2016/016565
Authority: WO
Inventors: Joaquim Gomes Da Costa Eulalio De SOUZA
Original assignee: Hewlett Packard Enterprise Development Lp
Priority date: 2016-02-04
Filing date: 2016-02-04
Publication date: 2017-08-10

Abstract

In one example, a system can include non-volatile memory including shared memory storing a plurality of data. The system can include a plurality of compute nodes. Each of the plurality of compute nodes can access the plurality of data directly independent of communicating through an additional node of the plurality of compute nodes and independent of transferring data of the plurality of data via a network.

Description

Background

[0001] Computing systems can utilize hardware, software, and/or logic to execute a number of operations. The computing system can include a number of nodes that access a number of memories. Each of the number of nodes can be associated with a particular portion of the number of memories. When a particular node wants to access memory that is associated with an additional node, the particular node can

communicate with the additional node through a network to access the memory. The particular node can use network communication (such as TCP/IP) based reads to access data from the additional node. The number of nodes can use a disk-centric shared-nothing distributed file system (such as Hadoop Distributed File System (HDFS)) to access memory at the number of nodes.

Brief Description of the Drawings

[0002] Figure 1 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure.

[0003] Figure 2 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure. [0004] Figure 3 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure.

[0005] Figure 4 illustrates a diagram of an example of a computing system for shared memory access consistent with the present disclosure.

[0006] Figure 5 illustrates a flow chart of an example computing device for shared memory access consistent with the present disclosure.

Detailed Description

[0007] A computing framework can include a cluster of hardware and a number of memories. A number of nodes (e.g., computing nodes) in the computing framework can be used to access the memory (e.g., non-volatile memory) through hardware. Each of the number of nodes can access the memory as shared memory. For example, the memory can be shared across the number of nodes without partitioning particular portions of memory to a particular node. A local file system (FS) (e.g., such as Ext4 + DAX) can be used by each node to access data (e.g., input data) stored in the shared memory. In some examples, the local FS can be used instead of a Hadoop Distributed File Share (HDFS) protocol, in some examples, the memory can be formed of and/or include memristors that are shared with across the nodes.

[0008] The number of nodes can access the memory through a number of layers of protocol and/or applications. For example, the number of layers can include a dataflow engine, a Java interface (e.g., Java Virtual Machine (JVM) interface), an operating system (OS), file system, and/or block device (BD). The number of nodes can access the memory by bypassing the JVM interface and avoid using Java language. The JVM interface can be bypassed by using a JVM "shim" ("JNI-fs") which passes to the file system (FS) (e.g., Ext4 + DAX) without further accessing JAVA, allowing the access of the data in native code (e.g., such as native machine code). In addition, in some examples, the OS, the FS, and/or the BD can be bypassed. In this way, the number of nodes can go directly to using native code and thereby using a thinner software stack. In addition, the number of nodes can bypass going through other nodes using a network as the memory is shared by the nodes directly. [0009] Figure 1 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure. The system in Figure 1 illustrates a number of nodes ("Node 1 ," "Node 2," "Node 3") 1 10-1 , 1 10-2, ... , 1 10-N that can be compute nodes, referenced herein as number of nodes 1 10. A compute node can refer to a desktop or server system, a virtual machine (VM), a blade in a blade server arrangement, and/or any other suitable computing resource or device. The number of nodes 1 10-1 , 1 10-1 , ... , 1 10-N can communicate with a number of memories (e.g. , non-volatile memories) 128-1 , 128-2, 128-3 etc. , referenced herein as memories 128. The memories 128 can include a number of non-volatile memory (such as memristors, in some examples). The number of nodes 1 10 can transfer data from at least one of the memories 128 through a number of communication layers. The communication layers (e.g., a software stack) can include applications ("APP") 1 12-1 , 1 12-2, 1 12-3, dataflow engines 1 14-1 , 1 14-2, 1 14-3, and Hadoop Distributed File Systems 1 6-1 , 1 16-2, 1 16-3. The dataflow engine 1 14-1 refers to a cluster computing framework designed for a duster of hardware in a particular computing architecture environment.

[0010] in Figure 1 , the particular computing architecture is referred to as a "shared nothing" architecture. In a "shared nothing" architecture, memory resources (e.g. , memories 128) can be directly connected to a particular compute node (and not any other compute node). In this architecture, a particular compute node (e.g. , node 1 10-1 ) uses a network 1 18 to access memories (e.g. , memory 128-2) that are not directly connected to the compute node. For example, the particular node 1 10-1 would communicate with node 1 10-2 through network 1 18 to access memory 128-2.

However, the particular node 1 10-1 can access memory 128-1 without transferring data via the network 1 18 as the particular node 1 10-1 is directly connected to memory 128-1 .

[0011] A node 1 10-1 can communicate through the network 1 18 to a memory 128-1 through additional layers such as a Java layer (e.g. , Java Virtual Machine (JVM)) 120-1 , an OS 122-1 , a file system 124-1 , and a block device 126-1 . A JVM 120-1 is an abstract computing machine that enables a computing node to run a Java program. The operating system 122-1 is system software that can manage computer hardware and/or software resources. A file system 124-1 can be a file system code at a kernel level thai is aware of an internal format used to store data, handle structures that represent the data (e.g., such as inodes), and is aware of how to read and/or write the data. ion. The file system 124-1 can refer to a journaling file system that monitors changes not yet committed to the file system's main portion by recording intentions of such changes in a data structure known as a 'journal.' A block device 126-1 refers to a computer data storage device drive that supports reading and/or writing data in fixed- size blocks, sectors, and/or clusters, generally 512 bytes or a multiple thereof in size.

[0012] Data movement in this "share nothing" architecture, both in initial load and in shuffling, can rely on disk-based file systems, such as HDFS 1 18, to persist data as this architecture uses memory disks to store persistent data. The disk-based file systems (e.g., HDFS 1 16-1 , 1 16-2) can connect a node's (e.g., node 1 10-1 's) file system (e.g., FS 124-1 ) to another node's (e.g., node 10-2's) file system (e.g., FS 24- 2) using a Java-based (e.g., Java 120-1 , 120-2) remote-access layer. The disk-based file systems use replication (commonly configured as triplication) to provide availability of data when a node crashes. The "share nothing" architecture moves large amounts of data for processing by the nodes 1 10 and transfers this data via a network 1 18, which can slow down the data transferring process. The data is typically written in a manage language such as Java and/or Scala that has a large amount of garbage collection, data copy, and/or processing overheads.

[0013] While Figure 1 is described as using a shared nothing architecture above, a shared something architecture can also be used by inserting a number of components (e.g., JNi-FS 221 -1 in Figure 2) and bypassing and/or avoiding usage of a number of components (e.g., HDFS 1 16-1 to 1 16-3, BD 126-1 to 126-3, and, in some examples, OS 122-1 to 122-3). Further, the inserted number of components can be located within the system of Figure 1 but are not illustrated for ease of description of the shared nothing operation. Further, the bypassed components of Figure 1 can be within Figure 2 even though they are bypassed and are not illustrated for ease of reference of the shared something architecture.

[0014] Figure 2 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure. The system in Figure 2 can be a multi-compute node system. The system in Figure 2 illustrates a number of nodes ("Node 1 ," "Node 2," "Node 3") 210-1 , 210-2, ... , 210-N that can be compute nodes, referenced herein as number of nodes 210. The number of nodes 210-1 , 210-2, 210-N can communicate with memory (e.g., non-volatile memory, illustrated as "MEM") 228. The memory 228 can be a number of memristors. In some example, the memory 228 can be non-volatile shared memory that is made available and/or shared with the number of nodes 210 through an Ext4/Dax file system. The number of nodes 210 can transfer data from memory 228 through a number of communication layers. The communication layers can include applications 212-1 , 212-2, 1 12-3, dataflow engines 214-1 , 214-2, 214-3, a Java interface 219-1 , 219-2, 219-30S 222, and a file system ("FS") 224.

[0015] Figure 2 illustrates a particular computing architecture referred to as a " shared something " architecture, in a "shared something" architecture, memory resources (e.g., "mem" 228) can be shared by a number of nodes 210 that each connect directly to the memory 228. In this architecture, a particular compute node (e.g., node 210-1 ) can communicate with (e.g., send and receive data with) memory 228 without using a network (such as network 1 18 in Figure 1 ). For example, the particular node 210-1 would communicate through additional layers including a FS 224 through JNI-fs 221 -1 to connect with the memories 228. A file system 224 can include the ext4 files sfem.

[0016] Data movement in this "share something" architecture, both in initial load and in shuffling, can transfer data without use of an HDFS (such as HDFS 1 16 in Figure 1 ). The "share something" architecture moves large amounts of data for processing by the nodes 210 without transferring data via a network (such as network 1 18 in Figurel ) which can speed up the data transferring process.

[0017] A node of the number of nodes 210-1 can run an application 212-1 using a Java interface 219-1. The Java interface 219-1 can communicate with a Java Native Interface (JNI-fs) 221 -1 , which is referred to herein as a file system "shim." A file system "shim" refers to an interface that allows the Java interface 219-1 to interface directly with native code (e.g., code of C/C++) of the data stored in memories 228. The Java interface 219-1 can connect directly with a file system 224 without using additional Java. When possible, the Java interface 219-1 can connect directly to the memory 228 for data loads to be issued from and/or stored into memory 228 using the JNI-FS 221 as a shim to communicate directly with the memory 228. In some examples, the file system 224 can interact with an OS 222 in order to check for permission to access, in some examples, the file system 224 can load directly from the memories (e.g., memristors) 228 and bypass the OS 222 kernel. For example, the file system 224 can bypass the OS 222 for a read of an already opened file using mmap like instructions.

[0018] Figure 3 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure. The system in Figure 3 can include a first multi-compute node system 331 (such as the multi-compute node system illustrated in Figure 2) sharing a first portion of memory 328-1 and a second multi- compute node system 333 sharing a second portion of memory 328. The first multi- compute node system 331 can include a software stack including an application ("App") 312-1 , 312-2, 312-3, Java Native Interface shims (JNI-fs) 321 -1 , 321 -2, 321 -3, Java Interfaces 320-1 , 320-2, 320-3, a FS 324, operating systems 322-1 , 322-2, 322-3, and shared memory 328-1. The second multi-compute node system can include a software stack including the same features (e.g., applications 312-4, 312-5, 312-6, JNI-fs 321 -4,

321 - 5, 321 -6, Java interface 320-4, 320-5, 320-6, file system 324, operating systems

322- 4, 322-5, 322-6, and shared memory 328-2).

[0019] FS 324 can be shared by the first and second multi-compute node systems 331 and 333. A shared memory 328-1 can be shared through an Remote Direct Memory Access (RDMA) 335 with a shared memory 328-2. An RDMA 335 refers to a direct memory access from memory of one computing device into memory of another computing device without involving an operating system associated with either computing device. This can allow for high-throughput, low-latency networking.

[0020] Figure 4 illustrates a diagram of an example of a computing system for shared memory access consistent with the present disclosure. Figures 4 and 5 illustrate examples of system and computing device 554 consistent with the present disclosure. Figure 4 illustrates a diagram of an example of a system for shared memory access consistent with the present disclosure. The system can include a database 442, a shared memory access system 444, and/or a number of engines (e.g., locate engine 446, access engine 448, load engine 450). The shared memory access system 444 can be in communication with the database 442 via a communication link, and can include the number of engines (e.g. , locate engine 446, access engine 448, load engine 450). The shared memory access system 444 can include additional or fewer engines than are illustrated to perform the various functions as described above in Figures 1 -3.

[0021] The number of engines (e.g., locate engine 448, access engine 448, load engine 450) can include a combination of hardware and programming, but at least hardware, that is to perform functions described herein (e.g. , locate a portion of data of a shared memory (e.g. , a memristor memory) using a file pathname and a byte offset, access the located portion of the data from the shared memory, load the located portion of data into a compute node of a plurality of compute nodes using a Java Native interface, etc.) stored in a memory resource (e.g., computer readable medium, machine readable medium, etc.) as well as hard-wired program (e.g. , logic).

[0022] The locate engine 446 can include hardware and/or a combination of hardware and programming, but at least hardware, to locate a portion of data of a shared memory using a file pathname and a byte offset. The portion of data can be located on a memory (e.g., such as memory 228 in Figure 2). The memory (e.g., a memristor memory) can be shared by a number of nodes (e.g. , number of nodes 210 in Figure 2). The pathname of the file and a byte offset can allow at least a node of the number of nodes to locate the file in the memory. The byte offset can indicate a location of a file in the memory at a particular position in the memory. The file pathname can identify the file to be access in the memory.

[0023] The access engine 448 can include hardware and/or a combination of hardware and programming, but at least hardware, to access a located portion of the data from the shared memory. The file can be accessed without transferring data on a network. The file can be accessed without using TCP/IP based reads to access the data from another node as the data is shared across all of the number of nodes.

[0024] The load engine 450 can include hardware and/or a combination of hardware and programming, but at least hardware, to load a located portion of data into a compute node of a plurality of compute nodes using a Java Native Interface (JNI). The located portion of data can be loaded into the compute node without using an additional compute node of the plurality of compute nodes. [0025] Figure 5 illustrates a diagram of an example computing device 554 consistent with the present disclosure. The computing device 554 can utilize software, hardware, firmware, and/or logic to perform functions described herein.

[0026] The computing device 554 can be any combination of hardware and program instructions to share information. The hardware, for example, can include a processing resource 556 and/or a memory resource 570 (e.g., computer-readable medium (CRM), machine readable medium (MRM), database, etc.). A processing resource 556, as used herein, can include any number of processors capable of executing instructions stored by a memory resource 570.

[0027] Processing resource 556 may be implemented in a single device or distributed across multiple devices. The program instructions (e.g., computer readable instructions (CRI)) can include instructions stored on the memory resource 570 and executable by the processing resource 556 to implement a function (e.g., perform a first segmentation of an image to determine an outline of a hand in the image and a center of the hand, perform a second segmentation of the image to segment a center region the image; compare the first segmentation and the second segmentation to determine a plurality of portions of the image associated with the outline of the hand, etc.).

[0028] The memory resource 570 can be in communication with a processing resource 556. A memory resource 570, as used herein, can include any number of memory components capable of storing instructions that can be executed by processing resource 556. Such memory resource 570 can be a non-transitory CRM or MRM.

Memory resource 570 may be integrated in a single device or distributed across multiple devices. Further, memory resource 570 may be fully or partially integrated in the same device as processing resource 556 or it may be separate but accessible to that device and processing resource 556. Thus, it is noted that the computing device 554 may be implemented on a participant device, on a server device, on a collection of server devices, and/or a combination of the participant device and the server device.

[0029] The memory resource 570 can be in communication with the processing resource 556 via a communication link (e.g., a path) 558. The communication link 558 can be local or remote to a machine (e.g. , a computing device) associated with the processing resource 556. Examples of a local communication link 558 can include an electronic bus internal to a machine (e.g., a computing device) where the memory resource 570 is volatile, non-volatile, fixed, and/or removable storage medium in communication with the processing resource 556 via the electronic bus.

[0030] A number of modules (e.g., locate module 572, access module 574, load module 576) can include CRI that when executed by the processing resource 556 can perform functions. The number of modules such as the locate module 572, the access module 574, and/or the load module 576 can be sub-modules of other modules. For example, the locate module 572 and the access module 574 can be sub-modules and/or contained within the same computing device. In another example, the number of modules (e.g., a locate module 572, an access module 574, a load module 576) can comprise individual modules at separate and distinct locations (e.g., CRM, etc.).

[0031] Each of the number of modules (e.g., locate module 572, access module 574, load module 576) can include instructions that when executed by the processing resource 556 can function as a corresponding engine as described herein. For example, the locate module 572 can include instructions that when executed by the processing resource 556 can function as the locate engine 444.

[0032] As used herein, "logic" is an alternative or additional processing resource to perform a particular action and/or function, etc, described herein, which includes hardware, e.g., various forms of transistor logic, application specific integrated circuits (ASICs), etc., as opposed to computer executable instructions, e.g., software firmware, etc., stored in memory and executable by a processor. Further, as used herein, "a" or "a number of something can refer to one or more such things. For example, "a number of widgets" can refer to one or more widgets.

[0033] The above specification, examples and data provide a description of the method and applications, and use of the system and method of the present disclosure. Since many examples can be made without departing from the spirit and scope of the system and method of the present disclosure, this specification merely sets forth some of the many possible example configurations and implementations.

Claims

What is claimed:

1 . A system, comprising:

shared memory storing a plurality of shared data;

a plurality of compute nodes, wherein each of the plurality of compute nodes access the plurality of shared data directly independent of:

communicating through an additional node of the plurality of compute nodes; and transferring data of the plurality of data via a network.

2. The system of claim 1 , wherein the plurality of compute nodes access the plurality of shared data using native machine code.

3. The system of claim 2, wherein the plurality of compute nodes access the plurality of shared data using a Java Native Interface (JNI) to read data using the native machine code.

4. The system of claim 2, wherein the plurality of compute nodes access the plurality of shared data without using managed Java code.

5. The system of claim 1 , wherein the plurality of compute nodes access the plurality of shared data independent of using a Hadoop Distributed File System (HDFS).

6. A system comprising:

a pool of shared non-volatile memory;

a plurality of compute nodes with direct and equal access to the entirety of the pool of shared non-volatile memory, wherein each of the plurality of compute nodes access each portion of the pool of shared non-volatiie memory:

independent of sending data over a network; and

using a shared file system.

7. The system of claim 6, comprising an operating system (OS) shared by the plurality of compute nodes.

8. The system of claim 7, wherein the operating system is bypassed by the plurality of compute nodes to access the pool of shared non-volatile memory.

9. The system of claim 8, wherein each of the processors execute instructions to run an application that uses a Java Native Interface to access the native code stored in the shared non-volatile memory.

10. The system of claim 9, wherein the native code stored in the shared non-volatile memory is accessed without using Java or a Java Virtual Machine.

1 1. The system of claim 6, comprising a non-transitory computer readable medium storing instructions executable by a processor for each of the plurality of compute nodes to run an application,

12. A non-transitory computer-readable medium storing instructions executable by a processor to:

locate a portion of data of a shared memory using a file pathname and a byte offset;

access the located portion of the data from the shared memory; and

load the located portion of data into a compute node of a plurality of compute nodes using a Java Native interface (JNI), wherein the located portion of data is loaded into the compute node without using an additional compute node of the plurality of compute nodes.

13. The non-transitory computer-readable medium of claim 12, wherein the instructions are executable by the processor to load the located portion of the data into the compute node without accessing a network.

14. The non-transitory computer-readable medium of claim 12, wherein the instructions are executable by the processor to access the located portion of data by an additional compute node of the plurality of compute nodes without communicating through the compute node and without transferring data via a network.

15. The non-transitory computer-readable medium of claim 12, wherein the instructions are executable by the processor to load the located portion of the data without partitioning data on the compute node of the plurality of compute nodes.