The Anatomy of the Grid
|
Enabling Scalable Virtual Organizations *
|
Ian Foster ¶
|
Carl Kesselman §
|
Steven Tuecke
|
{foster, tuecke}@mcs.anl.gov, carl@isi.edu
|
Abstract
|
“Grid” computing has emerged as an important new field, distinguished
from conventional
distributed computing by its focus on large-scale resource sharing,
innovative applications, and,
in some cases, high-performance orientation. In this article, we
define this new field. First, we
review the “Grid problem,” which we define as flexible, secure, coordinated
resource sharing
among dynamic collections of individuals, institutions, and resources—what
we refer to as virtual
organizations. In such settings,
we encounter unique authentication, authorization, resource
access, resource discovery, and other challenges. It is this class
of problem that is addressed by
Grid technologies. Next, we present an extensible and open Grid architecture, in which
protocols, services, application programming interfaces, and software
development kits are
categorized according to their roles in enabling resource sharing.
We describe requirements that
we believe any such mechanisms must satisfy and we discuss the importance
of defining a
compact set of intergrid protocols
to enable interoperability among different Grid systems.
Finally, we discuss how Grid technologies relate to other contemporary
technologies, including
enterprise integration, application service provider, storage service
provider, and peer-to-peer
computing. We maintain that Grid concepts and technologies complement
and have much to
contribute to these other approaches.
|
1 Introduction
|
The term “the Grid” was coined in the mid1990s to denote a proposed
distributed computing
infrastructure for advanced science and engineering [34]. Considerable
progress has since been
made on the construction of such an infrastructure (e.g., [10, 16,
46, 59]), but the term “Grid” has
also been conflated, at least in popular perception, to embrace everything
from advanced
networking to artificial intelligence. One might wonder whether the
term has any real substance
and meaning. Is there really a distinct “Grid problem” and hence
a need for new “Grid
technologies”? If so, what is the nature of these technologies, and
what is their domain of
applicability? While numerous groups have interest in Grid concepts
and share, to a significant
extent, a common vision of Grid architecture, we do not see consensus
on the answers to these
questions.
|
Our purpose in this article is to argue that the Grid concept is
indeed motivated by a real and
specific problem and that there is an emerging, well-defined Grid
technology base that addresses
significant aspects of this problem. In the process, we develop a
detailed architecture and
roadmap for current and future Grid technologies. Furthermore, we
assert that while Grid
technologies are currently distinct from other major technology trends,
such as Internet,
enterprise, distributed, and peer-to-peer computing, these other
trends can benefit significantly
from growing into the problem space addressed by Grid technologies.
|
|
¶
|
Mathematics and Computer
Science Division, Argonne National Laboratory, Argonne, IL 60439.
Department of Computer Science,
The University of Chicago, Chicago, IL 60657.
§Information Sciences Institute,
The University of Southern California, Marina del Rey, CA 90292.
|
* To appear: Intl J. Supercomputer Applications, 2001.
|
The Anatomy of the Grid
|
2
|
The real and specific problem that underlies the Grid concept is
coordinated resource sharing
and problem solving in dynamic, multi-institutional virtual organizations.
The sharing that we
are concerned with is not primarily file exchange but rather direct
access to computers, software,
data, and other resources, as is required by a range of collaborative
problem-solving and resource-
brokering strategies emerging in industry, science, and engineering.
This sharing is, necessarily,
highly controlled, with resource providers and consumers defining
clearly and carefully just what
is shared, who is allowed to share, and the conditions under which
sharing occurs. A set of
individuals and/or institutions defined by such sharing rules form
what we call a virtual
organization (VO).
|
The following are examples of VOs: the application service providers,
storage service providers,
cycle providers, and consultants engaged by a car manufacturer to
perform scenario evaluation
during planning for a new factory; members of an industrial consortium
bidding on a new
aircraft; a crisis management team and the databases and simulation
systems that they use to plan
a response to an emergency situation; and members of a large, international,
multiyear high-
energy physics collaboration. Each of these examples represents an
approach to computing and
problem solving based on collaboration in computation- and data-rich
environments.
|
As these examples show, VOs vary tremendously in their purpose, scope,
size, duration,
structure, community, and sociology. Nevertheless, careful study
of underlying technology
requirements leads us to identify a broad set of common concerns
and requirements. In
particular, we see a need for highly flexible sharing relationships,
ranging from client-server to
peer-to-peer; for sophisticated and precise levels of control over
how shared resources are used,
including fine-grained and multi-stakeholder access control, delegation,
and application of local
and global policies; for sharing of varied resources, ranging from
programs, files, and data to
computers, sensors, and networks; and for diverse usage modes, ranging
from single user to
multi-user and from performance sensitive to cost-sensitive and hence
embracing issues of quality
of service, scheduling, co-allocation, and accounting.
|
Current distributed computing technologies do not address the concerns
and requirements just
listed. For example, current Internet technologies address communication
and information
exchange among computers but do not provide integrated approaches
to the coordinated use of
resources at multiple sites for computation. Business-to-business
exchanges [57] focus on
information sharing (often via centralized servers). So do virtual
enterprise technologies,
although here sharing may eventually extend to applications and physical
devices (e.g., [8]).
Enterprise distributed computing technologies such as CORBA and Enterprise
Java enable
resource sharing within a single organization. The Open Group’s Distributed
Computing
Environment (DCE) supports secure resource sharing across sites,
but most VOs would find it too
burdensome and inflexible. Storage service providers (SSPs) and application
service providers
(ASPs) allow organizations to outsource storage and computing requirements
to other parties, but
only in constrained ways: for example, SSP resources are typically
linked to a customer via a
virtual private network (VPN). Emerging “Distributed computing” companies
seek to harness
idle computers on an international scale [31] but, to date, support
only highly centralized access
to those resources. In summary, current technology either does not
accommodate the range of
resource types or does not provide the flexibility and control on
sharing relationships needed to
establish VOs.
|
It is here that Grid technologies enter the picture. Over the past
five years, research and
development efforts within the Grid community have produced protocols,
services, and tools that
address precisely the challenges that arise when we seek to build
scalable VOs. These
technologies include security solutions that support management of
credentials and policies when
computations span multiple institutions; resource management protocols
and services that support
secure remote access to computing and data resources and the co-allocation
of multiple resources;
|
The Anatomy of the Grid
|
information query protocols and services that provide configuration
and status information about
resources, organizations, and services; and data management services
that locate and transport
datasets between storage systems and applications.
|
Because of their focus on dynamic, cross-organizational sharing,
Grid technologies complement
rather than compete with existing distributed computing technologies.
For example, enterprise
distributed computing systems can use Grid technologies to achieve
resource sharing across
institutional boundaries; in the ASP/SSP space, Grid technologies
can be used to establish
dynamic markets for computing and storage resources, hence overcoming
the limitations of
current static configurations. We discuss the relationship between
Grids and these technologies
in more detail below.
|
3
|
In the rest of this article, we expand upon each of these points
in turn. Our objectives are to (1)
clarify the nature of VOs and Grid computing for those unfamiliar
with the area; (2) contribute to
the emergence of Grid computing as a discipline by establishing a
standard vocabulary and
defining an overall architectural framework; and (3) define clearly
how Grid technologies relate
to other technologies, explaining both why emerging technologies
do not yet solve the Grid
computing problem and how these technologies can benefit from Grid
technologies.
|
It is our belief that VOs have the potential to change dramatically
the way we use computers to
solve problems, much as the web has changed how we exchange information.
As the examples
presented here illustrate, the need to engage in collaborative processes
is fundamental to many
diverse disciplines and activities: it is not limited to science,
engineering and business activities.
It is because of this broad applicability of VO concepts that Grid
technology is important.
|
2 The Emergence of Virtual Organizations
|
Consider the following four scenarios:
|
1. A company needing to reach a decision on the placement of a new
factory invokes a
sophisticated financial
forecasting model from an ASP, providing it with access to
appropriate proprietary
historical data from a corporate database on storage systems
operated by an SSP. During
the decision-making meeting, what-if scenarios are run
collaboratively and interactively,
even though the division heads participating in the
decision are located in
different cities. The ASP itself contracts with a cycle provider for
additional “oomph” during
particularly demanding scenarios, requiring of course that
cycles meet desired security
and performance requirements.
|
2. An industrial consortium formed to develop a feasibility study
for a next-generation
supersonic aircraft undertakes
a highly accurate multidisciplinary simulation of the entire
aircraft. This simulation
integrates proprietary software components developed by
different participants,
with each component operating on that participant’s computers and
having access to appropriate
design databases and other data made available to the
consortium by its members.
|
3. A crisis management team responds to a chemical spill by using
local weather and soil
models to estimate the spread
of the spill, determining the impact based on population
location as well as geographic
features such as rivers and water supplies, creating a short-
term mitigation plan (perhaps
based on chemical reaction models), and tasking
emergency response personnel
by planning and coordinating evacuation, notifying
hospitals, and so forth.
|
4. Thousands of physicists at hundreds of laboratories and universities
worldwide come
together to design, create,
operate, and analyze the products of a major detector at CERN,
|
The
Anatomy of the Grid
|
the European high energy physics laboratory. During the analysis
phase, they pool their
computing, storage, and networking resources to create a “Data Grid”
capable of
analyzing petabytes of data [22, 44, 53].
|
These four examples differ in many respects: the number and type
of participants, the types of
activities, the duration and scale of the interaction, and the resources
being shared. But they also
have much in common, as discussed in the following (see also Figure
1).
|
In each case, a number of mutually distrustful participants with
varying degrees of prior
relationship (perhaps none at all) want to share resources in order
to perform some task.
Furthermore, sharing is about more than simply document exchange
(as in “virtual enterprises”
[18]): it can involve direct access to remote software, computers,
data, sensors, and other
resources. For example, members of a consortium may provide access
to specialized software
and data and/or pool their computational resources.
|
4
|
P
|
Multidisciplinary
design
using
programs & data at
multiple
locations
|
“Participants
in P
can
run program
A”
|
“Participants
in
Q can
use
cycles
if idle
and
budget not
exceeded”
|
Q
|
Ray
tracing using cycles
provided
by cycle sharing
consortium
|
“Participants
in P
can
run program
B”
|
“Participants
in P
can
read data D”
|
Figure 1: An actual organization can participate
in one or more VOs by sharing some or all of its
resources. We show three actual organizations (the ovals), and two
VOs: P, which links participants in an
aerospace design consortium, and Q, which links colleagues who have
agreed to share spare computing
cycles, for example to run ray tracing computations. The organization
on the left participates in P, the one
to the right participates in Q, and the third is a member of both
P and Q. The policies governing access to
resources (summarized in “quotes”) vary according to the actual organizations,
resources, and VOs
involved.
|
Resource sharing is conditional: each resource owner makes resources
available, subject to
constraints on when, where, and what can be done. For example, a
participant in VO P of Figure
1 might allow VO partners to invoke their simulation service only
for “simple” problems.
Resource consumers may also place constraints on properties of the
resources they are prepared
to work with. For example, a participant in VO Q might accept only
pooled computational
resources certified as “secure.” The implementation of such constraints
requires mechanisms for
expressing policies, for establishing the identity of a consumer
or resource (authentication), and
for determining whether an operation is consistent with applicable
sharing relationships
(authorization).
|
Sharing relationships can vary dynamically over time, in terms of
the resources involved, the
nature of the access permitted, and the participants to whom access
is permitted. And these
relationships do not necessarily involve an explicitly named set
of individuals, but rather may be
|
The Anatomy of the Grid
|
defined implicitly by the policies that govern access to resources.
For example, an organization
might enable access by anyone who can demonstrate that they are a
“customer” or a “student.”
|
The dynamic nature of sharing relationships means that we require
mechanisms for discovering
and characterizing the nature of the relationships that exist at
a particular point in time. For
example, a new participant joining VO Q must be able to determine
what resources it is able to
access, the “quality” of these resources, and the policies that govern
access.
|
5
|
Sharing relationships are often not simply client-server, but peer
to peer: providers can be
consumers, and sharing relationships can exist among any subset of
participants. Sharing
relationships may be combined to coordinate use across many resources,
each owned by different
organizations. For example, in VO Q, a computation started on one
pooled computational
resource may subsequently access data or initiate subcomputations
elsewhere. The ability to
delegate authority in controlled ways becomes important in such situations,
as do mechanisms for
coordinating operations across multiple resources (e.g., coscheduling).
|
The same resource may be used in different ways, depending on the
restrictions placed on the
sharing and the goal of the sharing. For example, a computer may
be used only to run a specific
piece of software in one sharing arrangement, while it may provide
generic compute cycles in
another. Because of the lack of a priori knowledge about how a resource
may be used,
performance metrics, expectations, and limitations (i.e., quality
of service) may be part of the
conditions placed on resource sharing or usage.
|
These characteristics and requirements define what we term a virtual organization, a concept
that
we believe is becoming fundamental to much of modern computing. VOs
enable disparate
groups of organizations and/or individuals to share resources in
a controlled fashion, so that
members may collaborate to achieve a shared goal.
|
3 The Nature of Grid Architecture
|
The establishment, management, and exploitation of dynamic, cross-organizational
VO sharing
relationships require new technology. We structure our discussion
of this technology in terms of
a Grid architecture that identifies fundamental system components, specifies the purpose
and
function of these components, and indicates how these components
interact with one another.
|
In defining a Grid architecture, we start from the perspective that
effective VO operation requires
that we be able to establish sharing relationships among any potential participants.
Interoperability is thus the central issue to be addressed. In a
networked environment,
interoperability means common protocols. Hence, our Grid architecture
is first and foremost a
protocol architecture, with protocols defining
the basic mechanisms by which VO users and
resources negotiate, establish, manage, and exploit sharing relationships.
A standards-based open
architecture facilitates extensibility, interoperability, portability,
and code sharing; standard
protocols make it easy to define standard services that provide enhanced
capabilities. We can
also construct Application Programming Interfaces and Software Development
Kits (see
Appendix for definitions) to provide the programming abstractions
required to create a usable
Grid. Together, this technology and architecture constitute what
is often termed middleware
(“the services needed to support a common set of applications in
a distributed network
environment” [3]), although we avoid that term here due to its vagueness.
We discuss each of
these points in the following.
|
Why is interoperability such a fundamental concern? At issue is our
need to ensure that sharing
relationships can be initiated among arbitrary parties, accommodating
new participants
dynamically, across different platforms, languages, and programming
environments. In this
context, mechanisms serve little purpose if they are not defined
and implemented so as to be
|
The Anatomy of the Grid
|
6
|
interoperable across organizational boundaries, operational policies,
and resource types. Without
interoperability, VO applications and participants are forced to
enter into bilateral sharing
arrangements, as there is no assurance that the mechanisms used between
any two parties will
extend to any other parties. Without such assurance, dynamic VO formation
is all but impossible,
and the types of VOs that can be formed are severely limited. Just
as the Web revolutionized
information sharing by providing a universal protocol and syntax
(HTTP and HTML) for
information exchange, so we require standard protocols and syntaxes
for general resource
sharing.
|
Why are protocols critical to interoperability? A protocol definition
specifies how distributed
system elements interact with one another in order to achieve a specified
behavior, and the
structure of the information exchanged during this interaction. This
focus on externals
(interactions) rather than internals (software, resource characteristics)
has important pragmatic
benefits. VOs tend to be fluid; hence, the mechanisms used to discover
resources, establish
identity, determine authorization, and initiate sharing must be flexible
and lightweight, so that
resource-sharing arrangements can be established and changed quickly.
Because VOs
complement rather than replace existing institutions, sharing mechanisms
cannot require
substantial changes to local policies and must allow individual institutions
to maintain ultimate
control over their own resources. Since protocols govern the interaction
between components,
and not the implementation of the components, local control is preserved.
|
Why are services important? A service (see Appendix) is defined solely
by the protocol that it
speaks and the behaviors that it implements. The definition of standard
services—for access to
computation, access to data, resource discovery, coscheduling, data
replication, and so forth—
allows us to enhance the services offered to VO participants and
also to abstract away resource-
specific details that would otherwise hinder the development of VO
applications.
|
Why do we also consider Application Programming Interfaces (APIs)
and Software Development
Kits (SDKs)? There is, of course, more to VOs than interoperability,
protocols, and services.
Developers must be able to develop sophisticated applications in
complex and dynamic execution
environments. Users must be able to operate these applications. Application
robustness,
correctness, development costs, and maintenance costs are all important
concerns. Standard
abstractions, APIs, and SDKs can accelerate code development, enable
code sharing, and enhance
application portability. APIs and SDKs are an adjunct to, not an
alternative to, protocols.
Without standard protocols, interoperability can be achieved at the
API level only by using a
single implementation everywhere—infeasible in many interesting VOs—or
by having every
implementation know the details of every other implementation. (The
Jini approach [6] of
downloading protocol code to a remote site does not circumvent this
requirement.)
|
In summary, our approach to Grid architecture emphasizes the identification
and definition of
protocols and services, first; and APIs and SDKs, second.
|