* The basic commands and functions of SQL
* How to use SQL for data administration (to create tables, indexes, and views)
* How to use SQL for data manipulation (to add, modify, delete, and retrieve data)
* How to use SQL to query a database to extract useful information
Introduction to SQL Part I
* The relational DBMS is the standard for database management.
* The Structured Query Language, SQL, is the standard for working with them
* This chapter is an introduction to essential SQL.
SQL strengths
Covers both
*
Data definition
*
Data manipulation
SQL is relatively easy to learn.
ANSI prescribes a standard SQL.
SQL Weaknesses
*
Some eccentric notation
o
use of ' marks; in strings, rather than "
o
Wildcards: % instead of *
*
Some things are hard to do
*
Different Vendors implement different dialects
*
Not a good conceptual match to most programming language
*
Strictly DDL and DML no standard procedural language.
DB2 concepts
* DB2 consists of multiple "instances" on each server (we have one)
* Within each instance there are databases: we have two, SAMPLE and DBMS, and will be using DBMS
* Within the databases are schemas one for each user.
* Authorization for users is via the system (cs1 account). Your schema name and user name are the same as your username on CS1
Setup for demonstrations
1. Use the winsql program to connect to the datasource DBMS using your CS1 username and password
2. Perform the following commands to create the tables:
drop table vendor; drop table product; drop table customer;
create table vendor like CH06_SALESCO.vendor;
insert into vendor select * from CH06_SALESCO.vendor;
alter table vendor add primary key (v_code);
create table product like CH06_SALESCO.product;
insert into product select * from CH06_SALESCO.product;
alter table product add primary key (p_code)
add foreign key (v_code) references vendor on delete set null on update restrict;
Data Definition Commands
The Database Model
Simple Database -- PRODUCT and VENDOR tables
Each product is supplied by only a single vendor.
A vendor may supply many products.
Data Definition Commands
The Tables and Their Components
*
The VENDOR table contains vendors who are not referenced in the PRODUCT table. PRODUCT is optional to VENDOR.
*
Existing V_CODE values in the PRODUCT table must have a match in the VENDOR table.
*
A few products are supplied factory-direct, a few are made in-house, and a few may have been bought in a special warehouse sale. That is, a product is not necessarily supplied by a vendor. VENDOR is optional to PRODUCT.
Common SQL Datatypes
Data Type
SQL
Numeric
NUMBER(L,D)
DECIMAL(L,D)
INTEGER
SMALLINT
Character
CHAR (L)
VARCHAR (L)
Date
DATE
Data Definition Commands
Creating the Database Structure
This varies among databases. In DB2 there are instances
within each instance there are databases
within each database there are schemas (one for each user). DB2 users are the same as Operating System users.
Statements in DB2 referencing a table include schema (SELECT * FROM schema.tablename) (current schema are implicit).
Data Definition Commands
Creating Table Structures
CREATE TABLE
Thursday, November 6, 2008
Normalization
* What is Normalization?
* Why is its done?
* The normal forms 1NF, 2NF, 3NF, BCNF, 4NF
* Transforming normal forms
* E-R modeling and normalization
* Denormalization
Database Tables and Normalization
Normalization is a process for assigning attributes to entities to:
* Reduce data redundancies
o Help eliminate data anomalies
* Produce controlled redundancies to link tables
* No information is lost in normalization
* Result will be a database that can produce the same information as the original
Normalization Process
Normalizatin works through a series of stages called normal forms:
* First Normal form (1NF)
* Second normal form (2NF)
* Third normal form (3NF)
* etc (4th and 5th)
* 2NF is better than 1NF; 3NF is better than 2NF
* For most business database design purposes, 3NF is highest we need to go in the normalization process
* Highest level of normalization is not always most desirable
The Need for Normalization
Example: company that manages building projects
* Charges its clients by billing hours spent on each contract
* Hourly billing rate is dependent on employee�s position
* Periodically, a report is generated that contains information as follows
Sample Report
Table derived form Above
The Need for Normalization
* Structure of data set in Figure 5.1 does not handle data very well
* The table structure appears to work; report is generated with ease
* Unfortunately, the report may yield different results, depending on what data anomaly has occurred
Issues
* Table entries invite data inconsistencies
* Table displays potential data anomalies
o Update: Modifying JOB_CLASS
o Insertion: New Employee must be assigned project
o Deletion: If employee deleted, other vital data lost: if emp 103 leaves lose info on Elect Engineers
Repeating Group
Repeating group
Derives its name from the fact that a group of multiple (related) entries can exist for any single key attribute occurrence
* Relational table must not contain repeating groups
* Normalizing the table structure will reduce these data redundancies
* Normalization is three-step procedure
Converting to First Normal Form
A table in a relational database must be in 1NF.
* Repeating groups must be eliminated
* Primary key determined
o Uniquely identify attribute values (rows)
o All attributes dependent on primary key
o In example: Combination of PROJ_NUM and EMP_NUM
Dependencies
* Dependencies can be depicted with the help of a diagram
* Dependency diagram:
o Depicts all dependencies found within a given table structure
o Helpful in getting bird�s-eye view of all relationships among a table�s attributes
o Use makes it much less likely that an important dependency will be overlooked
* Desirable dependencies based on entire primary key
* Less desirable dependencies
Partial:
Based on part of composite primary key
Transitive:
One nonprime attribute depends on another nonprime attribute
Dependency Diagram
Fig 5.3: Dependency Diagram
1NF: Definition
* Tabular format in which:
o All key attributes are defined
o There are no repeating groups in the table
o All attributes are dependent on primary key
* All relational tables must satisfy 1NF requirements
* Some tables contain partial dependencies
o Dependencies based on only part of the primary key
o Sometimes used for performance reasons, but should be used with caution
* Still subject to data redundancies
Second Normal Form
1. Identify all key components
* Write each key component on separate line
* Write original key on last line
* Write dependent attributes after each key.
2. Each line will become a new table
Second Normal Form Conversion Results
Second Normal Form Defined
Table is in second normal form (2NF) if:
* It is in 1NF and
* It includes no partial dependencies:
* No attribute is dependent on only a portion of the primary key
Converting to Third Normal Form
* Resolve transitive dependencies (attributes dependent on non-key attributes)
* Create separate table for each transitive dependency
3NF Results
Boyce-Codd Normal Form
* Every determinant in the table is a candidate key
o Has same characteristics as primary key, but for some reason, not chosen to be primary key
* If a table contains only one candidate key, the 3NF and the BCNF are equivalent
* BCNF can be violated only if the table contains more than one candidate key
BCNF (cont)
* Most designers consider the Boyce-Codd normal form (BCNF) as a special case of 3NF
* A table is in 3NF if it is in 2NF and there are no transitive dependencies
o A table can be in 3NF and not be in BCNF
o A transitive dependency exists when one nonprime attribute is dependent on another nonprime attribute
o A nonkey attribute is the determinant of a key attribute
Table in 3nf but not BCNF
Decomposition to BCNF
Decomposition to BCNF
Fourth Normal Form
* Table is in 3NF
* Has no multiple sets of multivalued dependencies
Conversion to 4NF
* 4NF is largely academic if tables conform to the following two rules:
o All attributes are dependent on primary key but independent of each other
o No row contains two or more multivalued facts about an entity
Improving the Design
* Table structures are cleaned up to eliminate the troublesome initial partial and transitive dependencies
* Normalization cannot, by itself, be relied on to make good designs
* It is valuable because its use helps eliminate data redundancies
Improving the Design (cont)
The following changes were made:
* PK assignment
* Naming conventions
* Attribute atomicity
* Adding attributes
* Adding relationships
* Refining PKs
* Maintaining historical accuracy
* Dealing with derived attributes
Completed Database
Completed Database
Completed Database: Assign Table
5.6b: Assign Table
Completed Database: Employee
Final ERD for contracting company
Final ERD
Limitations on System Assigned Keys
* System-assigned primary key may not prevent confusing entries
* Data entries in Table 5.2 are inappropriate because they duplicate existing records
* Yet there has been no violation of either entity integrity or referential integrity
* Perhaps Job Description needs to be unique
Normalization and Database Design
* Normalization should be part of design process
* Make sure that proposed entities meet required normal form before table structures are created
* Many real-world databases have been improperly designed or burdened with anomalies if improperly modified during course of time
* You may be asked to redesign and modify existing databases
Normalization and Database Design (cont)
* E-R Diagram provides macro view, determines entities
* Normalization provides micro view of entities
o Focuses on characteristics of specific entities
o May yield additional entities
* Difficult to separate Normalization and ER diagramming
* Extra check: No attribute that is not a (primary/foreign) key should be repeated in the database (except to record historical data)
Denormalization
* Normalization is one of many database design goals
Normalized table requirements
o Additional processing
o Loss of system speed
* Normalization purity is difficult to sustain due to conflict in
o Design efficiency
o Information requirements
o Processing
* Do not be too quick to denormalize
Unnormalized Table Defects
* Data updates less efficient
* Indexing more cumbersome
* No simple strategies for creating views
Overnormalization
* This is done for performance reason frequently in distributed and clustered database systems
* Splits tables beyond the pont required for normal forms.
* In horizontal partitioning rows of a single logical table are split among several physical tables (e.g. geographically)
* In vertical partitioning a table is split vertically with commonly accessed columns in separate physical tables
* What is Normalization?
* Why is its done?
* The normal forms 1NF, 2NF, 3NF, BCNF, 4NF
* Transforming normal forms
* E-R modeling and normalization
* Denormalization
Database Tables and Normalization
Normalization is a process for assigning attributes to entities to:
* Reduce data redundancies
o Help eliminate data anomalies
* Produce controlled redundancies to link tables
* No information is lost in normalization
* Result will be a database that can produce the same information as the original
Normalization Process
Normalizatin works through a series of stages called normal forms:
* First Normal form (1NF)
* Second normal form (2NF)
* Third normal form (3NF)
* etc (4th and 5th)
* 2NF is better than 1NF; 3NF is better than 2NF
* For most business database design purposes, 3NF is highest we need to go in the normalization process
* Highest level of normalization is not always most desirable
The Need for Normalization
Example: company that manages building projects
* Charges its clients by billing hours spent on each contract
* Hourly billing rate is dependent on employee�s position
* Periodically, a report is generated that contains information as follows
Sample Report
Table derived form Above
The Need for Normalization
* Structure of data set in Figure 5.1 does not handle data very well
* The table structure appears to work; report is generated with ease
* Unfortunately, the report may yield different results, depending on what data anomaly has occurred
Issues
* Table entries invite data inconsistencies
* Table displays potential data anomalies
o Update: Modifying JOB_CLASS
o Insertion: New Employee must be assigned project
o Deletion: If employee deleted, other vital data lost: if emp 103 leaves lose info on Elect Engineers
Repeating Group
Repeating group
Derives its name from the fact that a group of multiple (related) entries can exist for any single key attribute occurrence
* Relational table must not contain repeating groups
* Normalizing the table structure will reduce these data redundancies
* Normalization is three-step procedure
Converting to First Normal Form
A table in a relational database must be in 1NF.
* Repeating groups must be eliminated
* Primary key determined
o Uniquely identify attribute values (rows)
o All attributes dependent on primary key
o In example: Combination of PROJ_NUM and EMP_NUM
Dependencies
* Dependencies can be depicted with the help of a diagram
* Dependency diagram:
o Depicts all dependencies found within a given table structure
o Helpful in getting bird�s-eye view of all relationships among a table�s attributes
o Use makes it much less likely that an important dependency will be overlooked
* Desirable dependencies based on entire primary key
* Less desirable dependencies
Partial:
Based on part of composite primary key
Transitive:
One nonprime attribute depends on another nonprime attribute
Dependency Diagram
Fig 5.3: Dependency Diagram
1NF: Definition
* Tabular format in which:
o All key attributes are defined
o There are no repeating groups in the table
o All attributes are dependent on primary key
* All relational tables must satisfy 1NF requirements
* Some tables contain partial dependencies
o Dependencies based on only part of the primary key
o Sometimes used for performance reasons, but should be used with caution
* Still subject to data redundancies
Second Normal Form
1. Identify all key components
* Write each key component on separate line
* Write original key on last line
* Write dependent attributes after each key.
2. Each line will become a new table
Second Normal Form Conversion Results
Second Normal Form Defined
Table is in second normal form (2NF) if:
* It is in 1NF and
* It includes no partial dependencies:
* No attribute is dependent on only a portion of the primary key
Converting to Third Normal Form
* Resolve transitive dependencies (attributes dependent on non-key attributes)
* Create separate table for each transitive dependency
3NF Results
Boyce-Codd Normal Form
* Every determinant in the table is a candidate key
o Has same characteristics as primary key, but for some reason, not chosen to be primary key
* If a table contains only one candidate key, the 3NF and the BCNF are equivalent
* BCNF can be violated only if the table contains more than one candidate key
BCNF (cont)
* Most designers consider the Boyce-Codd normal form (BCNF) as a special case of 3NF
* A table is in 3NF if it is in 2NF and there are no transitive dependencies
o A table can be in 3NF and not be in BCNF
o A transitive dependency exists when one nonprime attribute is dependent on another nonprime attribute
o A nonkey attribute is the determinant of a key attribute
Table in 3nf but not BCNF
Decomposition to BCNF
Decomposition to BCNF
Fourth Normal Form
* Table is in 3NF
* Has no multiple sets of multivalued dependencies
Conversion to 4NF
* 4NF is largely academic if tables conform to the following two rules:
o All attributes are dependent on primary key but independent of each other
o No row contains two or more multivalued facts about an entity
Improving the Design
* Table structures are cleaned up to eliminate the troublesome initial partial and transitive dependencies
* Normalization cannot, by itself, be relied on to make good designs
* It is valuable because its use helps eliminate data redundancies
Improving the Design (cont)
The following changes were made:
* PK assignment
* Naming conventions
* Attribute atomicity
* Adding attributes
* Adding relationships
* Refining PKs
* Maintaining historical accuracy
* Dealing with derived attributes
Completed Database
Completed Database
Completed Database: Assign Table
5.6b: Assign Table
Completed Database: Employee
Final ERD for contracting company
Final ERD
Limitations on System Assigned Keys
* System-assigned primary key may not prevent confusing entries
* Data entries in Table 5.2 are inappropriate because they duplicate existing records
* Yet there has been no violation of either entity integrity or referential integrity
* Perhaps Job Description needs to be unique
Normalization and Database Design
* Normalization should be part of design process
* Make sure that proposed entities meet required normal form before table structures are created
* Many real-world databases have been improperly designed or burdened with anomalies if improperly modified during course of time
* You may be asked to redesign and modify existing databases
Normalization and Database Design (cont)
* E-R Diagram provides macro view, determines entities
* Normalization provides micro view of entities
o Focuses on characteristics of specific entities
o May yield additional entities
* Difficult to separate Normalization and ER diagramming
* Extra check: No attribute that is not a (primary/foreign) key should be repeated in the database (except to record historical data)
Denormalization
* Normalization is one of many database design goals
Normalized table requirements
o Additional processing
o Loss of system speed
* Normalization purity is difficult to sustain due to conflict in
o Design efficiency
o Information requirements
o Processing
* Do not be too quick to denormalize
Unnormalized Table Defects
* Data updates less efficient
* Indexing more cumbersome
* No simple strategies for creating views
Overnormalization
* This is done for performance reason frequently in distributed and clustered database systems
* Splits tables beyond the pont required for normal forms.
* In horizontal partitioning rows of a single logical table are split among several physical tables (e.g. geographically)
* In vertical partitioning a table is split vertically with commonly accessed columns in separate physical tables
Entity Relationship Modeling
* How relationships between entities are defined and refined, and how such relationships are incorporated into the database design process
* How ERD components affect database design and implementation
* How to interpret the modeling symbols for the four most popular ER modeling tools
* That real-world database design often requires that you reconcile conflicting goals
Entity Relationship Model and Diagram
* ER model forms the basis of an ER diagram
* ERD represents the conceptual database as viewed by end user
* ERDs depict the ER model�s three main components:
o Entities
o Attributes
o Relationships
* Several different diagramming conventions
Entities
* Refers to the entity set and not to a single entity occurrence
* Corresponds to a table and not to a row in the relational environment
* In both the Chen and Crow�s Foot models, an entity is represented by a rectangle containing the entity�s name
* Entity name, a noun, is usually written in capital letters
Attributes
* Characteristics of entities
* Domain is set of possible values
* Primary keys underlined
Attributes (cont)
Attributes (cont)
Simple
Cannot be subdivided
Age, sex, GPA
Composite
Can be subdivided
Address: street city state zip
Single-valued
Has only a single value
Social security number
Multi-valued
Can have many values
Person may have several college degrees
Derived
Can be calculated from other information
Age can be derived from D.O.B.
Multivalued Attributes
Resolving Multivalued Attribute Problems
Although the conceptual model can handle multivalued attributes, you should not implement them in the relational DBMS
* Within original entity, create several new attributes, one for each of the original multivalued attribute�s components
o Can lead to major structural problems in the table
* Create a new entity composed of original multivalued attribute�s components
Creating New Attributes
Creating New Entity Set
Relationships
* Associations between entities
* Established by Business Rules
* Connected entities termed participants
* Connectivity describes relationship classification:
o 1:1, 1:M, M:N
* Cardinality
o Number of entity occurences associated with one occurence of related entity
Connectivity and Cardinality in an ERD
Relationship Strength
* Existence Dependent
o Entity's existence depends on existence of another related entities
o Existence-independent entities can exist apart from related entities
o Employee claims Child
Child is dependent on employee
* Weak (non-identifying)
o One entity is existence-independent on another
o PK of dependent entity doesn't contain PK component of parent entity
o Book is somewhat confused on this
* Strong (identifying)
o One entity is existence-dependent on another
o PK of related entity contains PK component of parent entity
Relationship Participation
* Optional
o Entity occurrence does not require a corresponding occurrence in related entity
o Shown by drawing a small circle on side of optional entity on ERD
* Mandatory
o Entity occurrence requires corresponding occurrence in related entity
o If no optionality symbol is shown on ERD, it is mandatory
Weak Entity
* Existence-dependent on another entity
* Has primary key that is partially or totally derived from parent entity
Weak Entity
Mandatory Class Course relationship
Optional Class Entity in Professor Teaches Class
Degree of Relationship
A relationships degree indicates the number of associated entities.
Implementation of a Ternary Relationship
Composite Entity
* Used to replace M:N relationships with 1:N relationships
* Bridge entities composed of primary keys of each entity needing connection
Entity Subtypes and Supertypes
Generalization Hierarchy
* Depicts relationships between higher-level supertype and lower-level subtype entities
* Supertype has shared attributes
* Subtypes have unique attributes
* Disjoint relationships
o Unique subtypes
o Non-overlapping
o Indicated with a `G'
* Overlapping subtypes use `Gs' Symbol
Nulls Created by Unique Attributes
Generalization Hierarchy: Disjoint
Generalization Hierarchy: Overlapping and Disjoint
Generalization
Supertype/Subtype relationship in an ERD
Comparison of ER Modeling Symbols
ER Modeling Symbols
Developing an E-R Diagram
* Iterative Process
1. Develop general narrative of organizational operations
2. Draw Basic E-R Model
3. Modify E-R model to incorporate newly discovered components/relationships
* Repeat until designers and users agree E-R model comple
Dealing with Conflicting Goals in Database Design
* Database must be designed to conform to design standards
* High-speed processing may require design compromises
* Quest for timely information may be the focus of database design
Other concerns:
o Security
o Performance
o Shared access
o Integrity
o Capabilities of actual DBMS
* How relationships between entities are defined and refined, and how such relationships are incorporated into the database design process
* How ERD components affect database design and implementation
* How to interpret the modeling symbols for the four most popular ER modeling tools
* That real-world database design often requires that you reconcile conflicting goals
Entity Relationship Model and Diagram
* ER model forms the basis of an ER diagram
* ERD represents the conceptual database as viewed by end user
* ERDs depict the ER model�s three main components:
o Entities
o Attributes
o Relationships
* Several different diagramming conventions
Entities
* Refers to the entity set and not to a single entity occurrence
* Corresponds to a table and not to a row in the relational environment
* In both the Chen and Crow�s Foot models, an entity is represented by a rectangle containing the entity�s name
* Entity name, a noun, is usually written in capital letters
Attributes
* Characteristics of entities
* Domain is set of possible values
* Primary keys underlined
Attributes (cont)
Attributes (cont)
Simple
Cannot be subdivided
Age, sex, GPA
Composite
Can be subdivided
Address: street city state zip
Single-valued
Has only a single value
Social security number
Multi-valued
Can have many values
Person may have several college degrees
Derived
Can be calculated from other information
Age can be derived from D.O.B.
Multivalued Attributes
Resolving Multivalued Attribute Problems
Although the conceptual model can handle multivalued attributes, you should not implement them in the relational DBMS
* Within original entity, create several new attributes, one for each of the original multivalued attribute�s components
o Can lead to major structural problems in the table
* Create a new entity composed of original multivalued attribute�s components
Creating New Attributes
Creating New Entity Set
Relationships
* Associations between entities
* Established by Business Rules
* Connected entities termed participants
* Connectivity describes relationship classification:
o 1:1, 1:M, M:N
* Cardinality
o Number of entity occurences associated with one occurence of related entity
Connectivity and Cardinality in an ERD
Relationship Strength
* Existence Dependent
o Entity's existence depends on existence of another related entities
o Existence-independent entities can exist apart from related entities
o Employee claims Child
Child is dependent on employee
* Weak (non-identifying)
o One entity is existence-independent on another
o PK of dependent entity doesn't contain PK component of parent entity
o Book is somewhat confused on this
* Strong (identifying)
o One entity is existence-dependent on another
o PK of related entity contains PK component of parent entity
Relationship Participation
* Optional
o Entity occurrence does not require a corresponding occurrence in related entity
o Shown by drawing a small circle on side of optional entity on ERD
* Mandatory
o Entity occurrence requires corresponding occurrence in related entity
o If no optionality symbol is shown on ERD, it is mandatory
Weak Entity
* Existence-dependent on another entity
* Has primary key that is partially or totally derived from parent entity
Weak Entity
Mandatory Class Course relationship
Optional Class Entity in Professor Teaches Class
Degree of Relationship
A relationships degree indicates the number of associated entities.
Implementation of a Ternary Relationship
Composite Entity
* Used to replace M:N relationships with 1:N relationships
* Bridge entities composed of primary keys of each entity needing connection
Entity Subtypes and Supertypes
Generalization Hierarchy
* Depicts relationships between higher-level supertype and lower-level subtype entities
* Supertype has shared attributes
* Subtypes have unique attributes
* Disjoint relationships
o Unique subtypes
o Non-overlapping
o Indicated with a `G'
* Overlapping subtypes use `Gs' Symbol
Nulls Created by Unique Attributes
Generalization Hierarchy: Disjoint
Generalization Hierarchy: Overlapping and Disjoint
Generalization
Supertype/Subtype relationship in an ERD
Comparison of ER Modeling Symbols
ER Modeling Symbols
Developing an E-R Diagram
* Iterative Process
1. Develop general narrative of organizational operations
2. Draw Basic E-R Model
3. Modify E-R model to incorporate newly discovered components/relationships
* Repeat until designers and users agree E-R model comple
Dealing with Conflicting Goals in Database Design
* Database must be designed to conform to design standards
* High-speed processing may require design compromises
* Quest for timely information may be the focus of database design
Other concerns:
o Security
o Performance
o Shared access
o Integrity
o Capabilities of actual DBMS
The Relational Database Model
* Tthe relational database model takes a logical view of data
* The relational model's basic components are entities, attributes, and relationships among entities
* How entities and their attributes are organized into tables
* About relational database operators, the data dictionary, and the system catalog
* How data redundancy is handled in the relational database model
* Why indexing is important
A logical View of Data
* Relational model
o Enables us to view data logically rather than physically
o Reminds us of simpler file concept of data storage or spreadsheet concept of table
* Table
o Two-dimensional structure composed of rows and columns
o Has advantages of structural and data independence
o Resembles a file from conceptual point of view
o Easier to understand than its hierarchical and network database predecessors
Tables
* Table also called a relation because the relational model's creator, Codd, used this term.
* A relation (and a relational DBMS's table) differs from a table in that there is no default order of rows or columns
o Note SQL does use a column order but this is a limitation of SQL, not the model.
* Each row (tuple) represents a single entity in an entity set
* Each column represents an attribute and each must have a distinct name
* Each row/column intersection (cell) represents a single atomic value
* All values in a column must have same data format
* Each column has a specific range of values known as a domain
* Each row must have some combination of attributes that uniquely identifies it (key)
Student Table Example
Student Table
Keys
* A key consists of one or more attributes that determine other attributes
* Primary key (PK) is an attribute (or a combination of attributes) that uniquely identifies any given entity (row)
* Key's role is based on determination
o If you know the value of attribute A, you can look up (determine) the value of attribute B
Types of Keys
Composite key
Key consisting of multiple attributes
Key attribute
Any attribute that is part of a key
Superkey
Any key that uniquely identifies each entity
The entire row (all attributes) is a superkey
Candidate key
A superkey without redundancies
Null Values
* Can represent
o An unknown attribute value
o A known, but missing, attribute value
o A 'not applicable' condition
* Not permitted in Primary key attributes
* Not the same as 0 or ""
* Can create problems in logic and using formulas
Controlled Redundancy
* Makes the relational database work
* Tables within the database share common attributes that enable us to link tables together
* Multiple occurrences of values in a table are not redundant when they are required to make the relationship work
* Uncontrolled Redundancy is unnecessary duplication of data
* In general any values duplicated between tables should be part of a Foreign Key -- Primary key relation (see below)
Foreign Keys
Foreign key (FK)
An attribute whose values match primary key values in the related table
Referential integrity
FK contains a value that refers to an existing valid tuple (row) in another relation
Secondary key
Key used strictly for data retrieval purposes
Keys (Summary)
Key Type Definition
Superkey Any attribute or combination of attributes that uniquely identifies a row in the table
Candiate Key aka Minimal Superkey. A superkey that does not contain a subset of attributes that is itself a superkey
Primary Key The candidate key selected to uniquely identify all rows
Cannot contain null values
Foreign Key An attribute (or combination of attributes) in one table that must either match the primary key of another table or be null
Secondary Key An attribute or combination of attributes used to make data retrieval more efficient.
Integrity Rules
Entity Integrity
* All entities are unique
* Each entity has unique primary key
Referential Integrity
* Foreign key must match primary key of referenced table or be null
* Impossible to delete row whose primary key has matching foreign key in another table
Relational Algebra
Key operators:
* SELECT
* PROJECT
* JOIN
Other operators
* INTERSECT
* UNION
* DIFFERENCE
* PRODUCT
* DIVIDE
SELECT
Subset of rows based on condition
Select
PROJECT
Subset of columns
PRODUCT
All possible pairings of two tables
Product
JOIN
* Combines information from two or more tables
* Real power behind the relational database, allowing the use of independent tables linked by common attributes
* Join is actually a combination of Product, Select, and Project
Fig 3-11 Join tables
Natural JOIN
Links tables by selecting rows with common values in common attribute(s) (typically equality on foreign keys).
Three stage process:
* Product creates one table
* Select yields appropriate rows
* Project yields single copy of each attribute to eliminate duplicate columns
Natural Join: Step 1 Product
Natural Join: Select
Natural Join Step 2: Select
Natural Join: Step 2 Select
Natural Join Step 3: Project
Natural Join Step 3: Project
Natural Join: Outcome
* Final outcome yields table that
o Does not include unmatched pairs
o Provides only copies of matches
* If no match is made between the table rows,
o the new table does not include the unmatched row
Other Joins
EquiJOIN
* Links tables based on equality condition that compares specified columns of tables
* Does not eliminate duplicate columns
* Join criteria must be explicitly defined
Theta JOIN
* EquiJOIN that compares specified columns of each table
Outer JOIN
* Matched pairs are retained
* Unmatched values in other tables left null
* Right and left
Intersect
Difference
Yields all rows in one table not found in the other table�that is, it subtracts one table from the other
Union
Divide
Requires user of single-column table and two-column table (I have never seen a DIVIDE)
Divide
Data Dictionary and System Catalog
Data Dictionary
* Metadata
* Detailed account of all tables in database
System Catalog
* System-created database
* Stores database characteristics and contents
* The database stores its own description
* Tables can be queried just like any other tables
* Automatically produces database documentation
Relationships within Relational Database
Classified by cardinalities of entities participating
* 1:M
o Relational modeling ideal
o Should be the norm in any relational database design
* M:N
o Must be avoided because they lead to data redundancies
o Split into a pari of 1:M Relationships
* 1:1
o Should be rare
The 1:1 Relationship
* One entity can be related to only one other entity, and vice versa
* Often means that entity components were not defined properly
* Could indicate that two entities actually belong in the same table
* Sometimes 1:1 relationships are appropriate to help avoid nulls
E-R Diagram
* Graphical representation of model
* Chen
* Crows Feet
ERD Symbols
* Rectangles represent entities
* Chen:
o diamonds represent relations
o numbers or m and n are cardinalities
* Crows Foot
o Lines represent relationships
o bar crossing line is "1" side
o 3 pronged crows foot is many side
Example 1:M Relationship
ERD diagram 1:M Relationship
Implementation of 1:M Relationship
M:N Relationships
M:N Relationship ERD
M:N Relationship
* Cannot be directly implemented in a Relational DBMS
Student and Class Tables
Linking Table
* Implementation of a composite entity
* Yields required M:N to 1:M conversion
* Composite entity table must contain at least the primary keys of original tables
* Linking table contains multiple occurrences of the foreign key values
* Additional attributes may be assigned as needed
Composite ERD
Composite Entities
Linking Tables: Implementation
Linking Tables
Data Redundancy Revisited
Foreign keys control and reduce redundancy.
It is not that there is no redundancy in a properly designed database: the redundancy there is is controlled.
Indexes
Note: I do not consider indexes part of the relational model. They are an implementation artifact to improve performance
* Arrangement used to logically access rows in a table
* Index key
o Index�s reference point
o Points to data location identified by the key
* Unique index
o Index in which the index key can only have one pointer value (row) associated with it
* Each index is associated with only one table
* Tthe relational database model takes a logical view of data
* The relational model's basic components are entities, attributes, and relationships among entities
* How entities and their attributes are organized into tables
* About relational database operators, the data dictionary, and the system catalog
* How data redundancy is handled in the relational database model
* Why indexing is important
A logical View of Data
* Relational model
o Enables us to view data logically rather than physically
o Reminds us of simpler file concept of data storage or spreadsheet concept of table
* Table
o Two-dimensional structure composed of rows and columns
o Has advantages of structural and data independence
o Resembles a file from conceptual point of view
o Easier to understand than its hierarchical and network database predecessors
Tables
* Table also called a relation because the relational model's creator, Codd, used this term.
* A relation (and a relational DBMS's table) differs from a table in that there is no default order of rows or columns
o Note SQL does use a column order but this is a limitation of SQL, not the model.
* Each row (tuple) represents a single entity in an entity set
* Each column represents an attribute and each must have a distinct name
* Each row/column intersection (cell) represents a single atomic value
* All values in a column must have same data format
* Each column has a specific range of values known as a domain
* Each row must have some combination of attributes that uniquely identifies it (key)
Student Table Example
Student Table
Keys
* A key consists of one or more attributes that determine other attributes
* Primary key (PK) is an attribute (or a combination of attributes) that uniquely identifies any given entity (row)
* Key's role is based on determination
o If you know the value of attribute A, you can look up (determine) the value of attribute B
Types of Keys
Composite key
Key consisting of multiple attributes
Key attribute
Any attribute that is part of a key
Superkey
Any key that uniquely identifies each entity
The entire row (all attributes) is a superkey
Candidate key
A superkey without redundancies
Null Values
* Can represent
o An unknown attribute value
o A known, but missing, attribute value
o A 'not applicable' condition
* Not permitted in Primary key attributes
* Not the same as 0 or ""
* Can create problems in logic and using formulas
Controlled Redundancy
* Makes the relational database work
* Tables within the database share common attributes that enable us to link tables together
* Multiple occurrences of values in a table are not redundant when they are required to make the relationship work
* Uncontrolled Redundancy is unnecessary duplication of data
* In general any values duplicated between tables should be part of a Foreign Key -- Primary key relation (see below)
Foreign Keys
Foreign key (FK)
An attribute whose values match primary key values in the related table
Referential integrity
FK contains a value that refers to an existing valid tuple (row) in another relation
Secondary key
Key used strictly for data retrieval purposes
Keys (Summary)
Key Type Definition
Superkey Any attribute or combination of attributes that uniquely identifies a row in the table
Candiate Key aka Minimal Superkey. A superkey that does not contain a subset of attributes that is itself a superkey
Primary Key The candidate key selected to uniquely identify all rows
Cannot contain null values
Foreign Key An attribute (or combination of attributes) in one table that must either match the primary key of another table or be null
Secondary Key An attribute or combination of attributes used to make data retrieval more efficient.
Integrity Rules
Entity Integrity
* All entities are unique
* Each entity has unique primary key
Referential Integrity
* Foreign key must match primary key of referenced table or be null
* Impossible to delete row whose primary key has matching foreign key in another table
Relational Algebra
Key operators:
* SELECT
* PROJECT
* JOIN
Other operators
* INTERSECT
* UNION
* DIFFERENCE
* PRODUCT
* DIVIDE
SELECT
Subset of rows based on condition
Select
PROJECT
Subset of columns
PRODUCT
All possible pairings of two tables
Product
JOIN
* Combines information from two or more tables
* Real power behind the relational database, allowing the use of independent tables linked by common attributes
* Join is actually a combination of Product, Select, and Project
Fig 3-11 Join tables
Natural JOIN
Links tables by selecting rows with common values in common attribute(s) (typically equality on foreign keys).
Three stage process:
* Product creates one table
* Select yields appropriate rows
* Project yields single copy of each attribute to eliminate duplicate columns
Natural Join: Step 1 Product
Natural Join: Select
Natural Join Step 2: Select
Natural Join: Step 2 Select
Natural Join Step 3: Project
Natural Join Step 3: Project
Natural Join: Outcome
* Final outcome yields table that
o Does not include unmatched pairs
o Provides only copies of matches
* If no match is made between the table rows,
o the new table does not include the unmatched row
Other Joins
EquiJOIN
* Links tables based on equality condition that compares specified columns of tables
* Does not eliminate duplicate columns
* Join criteria must be explicitly defined
Theta JOIN
* EquiJOIN that compares specified columns of each table
Outer JOIN
* Matched pairs are retained
* Unmatched values in other tables left null
* Right and left
Intersect
Difference
Yields all rows in one table not found in the other table�that is, it subtracts one table from the other
Union
Divide
Requires user of single-column table and two-column table (I have never seen a DIVIDE)
Divide
Data Dictionary and System Catalog
Data Dictionary
* Metadata
* Detailed account of all tables in database
System Catalog
* System-created database
* Stores database characteristics and contents
* The database stores its own description
* Tables can be queried just like any other tables
* Automatically produces database documentation
Relationships within Relational Database
Classified by cardinalities of entities participating
* 1:M
o Relational modeling ideal
o Should be the norm in any relational database design
* M:N
o Must be avoided because they lead to data redundancies
o Split into a pari of 1:M Relationships
* 1:1
o Should be rare
The 1:1 Relationship
* One entity can be related to only one other entity, and vice versa
* Often means that entity components were not defined properly
* Could indicate that two entities actually belong in the same table
* Sometimes 1:1 relationships are appropriate to help avoid nulls
E-R Diagram
* Graphical representation of model
* Chen
* Crows Feet
ERD Symbols
* Rectangles represent entities
* Chen:
o diamonds represent relations
o numbers or m and n are cardinalities
* Crows Foot
o Lines represent relationships
o bar crossing line is "1" side
o 3 pronged crows foot is many side
Example 1:M Relationship
ERD diagram 1:M Relationship
Implementation of 1:M Relationship
M:N Relationships
M:N Relationship ERD
M:N Relationship
* Cannot be directly implemented in a Relational DBMS
Student and Class Tables
Linking Table
* Implementation of a composite entity
* Yields required M:N to 1:M conversion
* Composite entity table must contain at least the primary keys of original tables
* Linking table contains multiple occurrences of the foreign key values
* Additional attributes may be assigned as needed
Composite ERD
Composite Entities
Linking Tables: Implementation
Linking Tables
Data Redundancy Revisited
Foreign keys control and reduce redundancy.
It is not that there is no redundancy in a properly designed database: the redundancy there is is controlled.
Indexes
Note: I do not consider indexes part of the relational model. They are an implementation artifact to improve performance
* Arrangement used to logically access rows in a table
* Index key
o Index�s reference point
o Points to data location identified by the key
* Unique index
o Index in which the index key can only have one pointer value (row) associated with it
* Each index is associated with only one table
Wednesday, November 5, 2008
Chapter 2: Data Models
Note There's minimal coverage of Hierarichcal and Network models as I consider them of historical importance only
Topics
* Business rules
* Data models
o About the basic data-modeling building blocks
o How the major data models evolved, and their advantages and disadvantages
o How data models can be classified by level of abstraction
Business Rules
* A policy, procedure, or principle within a specific organization's environment
* Applies to any organization that stores and uses data to generate information
* Description of operations that help to create and enforce actions within that organization's environment
* Sometimes are external to the organization
* Describe characteristics of the data as viewed by the company
Business Rules (cont)
* Must be rendered in writing
* Brief, precise, unambiguous
* Must be kept up to date
* Must be easy to understand and widely disseminated
Sources of Business Rules
* Company managers
* Policy makers
* Department managers
* Written documentation
o Procedures
o Standards
o Operations manuals
* Direct interviews with end users
Importance of Business Rules
* Promote creation of an accurate data model
* Standardize company's view of data
* Constitute a communications tool between users and designers
* Allow designer to understand the nature, role, and scope of data
* Allow designer to understand business processes
* Allow designer to develop appropriate relationship participation rules and constraints
The Importance of Data Models
* Good database design uses an appropriate data model as its foundation
* End-users have different views and needs for data
* Data model organizes data for various users
Data model
Relatively simple representation, usually graphical, of complex real-world data structures
Communications tool to facilitate interaction among the designer, the applications programmer, and the end user
Data Model Building Blocks
Entity
Anything about which data are to be collected and stored
Attribute
A characteristic of an entity (e.g. last name)
Relationship
an association among (two or more) entities
* One-to-many (1:M) relationship
* Many-to-many (M:N or M:M) relationship
* One-to-one (1:1) relationship
The Evolution of Data Models
* Hierarchical
* Network
* Relational
* Entity relationship
* Object oriented
Crucial Database Components
Schema
Conceptual organization of entire database as viewed by the database administrator
Subschema
Defines database portion "seen" by the application programs that actually produce the desired information from data contained within the database
Data Definition Language (DDL)
Define data characteristics and data structure in order to manipulate the data
DML Concepts
Schema Data Definition Language (DDL)
Enables database administrator to define schema components and relationships
Subschema DDL
Allows application programs to define database components that will be used
DML
Manipulates database contents
The Relational Model
* Developed by Codd (IBM) in 1970
* Considered ingenious but impractical in 1970
* Conceptually simple
* Computers lacked power to implement the relational model
* Today, microcomputers can run sophisticated relational database software
Relational Model: Basic Structure
* Relational Database Management System (RDBMS)
* Most important advantage of the RDBMS is its ability to let the user/designer operate in a human logical environment
Relation (Table)
Matrix consisting of a series of row/column intersections
Related to each other by sharing a common entity characteristic
Relational schema
Representation of relational database's entities, attributes within those entities, and relationships between those entities
Represented as DDL or Visually
Alternative Terminology
Relational Common Traditional
relation table file
tuple row record
attribute cell field
instance value value
Relational Schema
Relational Schema
Linking Relational Tables
Fig 2.4 Linking Relations
Advantages of the Relational Model
* Structural independence
* Improved conceptual simplicity
* Easier database design, implementation, management, and use
* Ad hoc query capability
* Powerful database management system
Disadvantages of the Relational Model
* Substantial hardware and system software overhead
* May not fit all business models
* Can facilitate poor design and implementation
* May promote "islands of information" problems
The Entity Relationship Model
* Widely accepted and adapted graphical tool for data modeling
* Introduced by Chen in 1976
* Graphical representation of entities and their relationships in a database structure
Basic Structure of the Entity Relationship Model
* Entity relationship diagram (ERD)
o Uses graphic representations to model database components
o Entity is mapped to a relational table
* Entity instance (or occurrence) is row in table
* Entity set is collection of like entities
* Connectivity labels types of relationships
Relationships: the Basic Chen ERD
Note: we will be using mostly CrowsFoot notation
Fig 2.6 Chen ERD
Crow's Foot ERD
Fig 2.7 Crow's Foot ERD
Advantages of E.R. Model
* Exceptional conceptual simplicity
* Visual representation
* Effective communication tool
* Integrated with the relational data model
Disadvantages of E.R. Model
* Limited constraint representation
* Limited relationship representation
* No data manipulation language
* Loss of information content
The Object Oriented Model
* Semantic data model (SDM) developed by Hammer and McLeod in 1981
* Modeled both data and their relationships in a single structure known as an object
* Basis of object oriented data model (OODM)
* OODM becomes the basis for the object oriented database management system (OODBMS)
Object Oriented Model (cont)
* Object is described by its factual content
o Like relational model's entity
* Includes information about relationships between facts within object and relationships with other objects
o Unlike relational model's entity
* Subsequent OODM development allowed an object to also contain operations (aka methods)
* Object becomes basic building block for autonomous structures
Developments that Boosted OODM
* Growing costs put a premium on code reusability
* Complex data types and system requirements became difficult to manage with a traditional RDBMS
* Became possible to support increasingly sophisticated transaction & information requirements
* Ever-increasing computing power made it possible to support the large computing overhead required
OODM: Basic Structure
* Object: abstraction of a real-world entity
* Attributes describe the properties of an object
* Objects that share similar characteristics are grouped in classes
* Classes are organized in a class hierarchy
* Inheritance is the ability of an object within the class hierarchy to inherit the attributes and methods of classes above it
Comparison of OO and ER Models
Fig 2.6 Comparison between OO and ER Model
OO Model Advantages
* Adds semantic content
* Visual presentation includes semantic content
* Database integrity
* Both structural and data independence
OO Model Disadvantages
* Slow pace of OODM standards development
* Complex navigational data access
* Steep learning curve
* High system overhead slows transactions
* Lack of market penetration
* Current relational DBMS's have many OO characteristics
Other Models: Extended Relational Data Model (ERDM)
* Semantic data model developed in response to increasing complexity of applications
* DBMS based on the ERDM often described as an object/relational database management system (O/RDBMS)
* Primarily geared to business applications
Date's Objection to ERDM
* Given proper support for domains, relational data models are quite capable of handling complex data
o Therefore, capability that is supposedly being extended is already there
* O/RDM label is not accurate because the relational data model's domain is not an object model structure
Data Models: A Summary
* Each new data model was intended to overcome the shortcomings of previous models
* Common characteristics:
o Conceptual simplicity without compromising the semantic completeness of the database
o Represent the real world as closely as possible
o Representation of real-world transformations (behavior) must be in compliance with consistency and integrity characteristics of any data model
Degrees of Abstraction
* Way of classifying data models
* Many processes begin at high level of abstraction and proceed to an ever-increasing level of detail
* Designing a usable database follows the same basic process
ANSI/SPARC Data Models
American National Standards Institute/Standards Planning and Requirements Committee (ANSI/SPARC)
Classified data models according to their degree of abstraction (1970s):
* Conceptual
* External
* Internal
Data Abstraction Levels
Data Abstraction Levels
The Conceptual Model
* Represents global view of the database
* Enterprise-wide representation of data as viewed by high-level managers
* Basis for identification and description of main data objects, avoiding details
* Most widely used conceptual model is the entity relationship (ER) model
Conceptual Model Tiny College
Advantages of Conceptual Model
* Provides a relatively easily understood macro level view of data environment
* Independent of both software and hardware
o Does not depend on the DBMS software used to implement the model
o Does not depend on the hardware used in the implementation of the model
o Changes in either the hardware or the DBMS software have no effect on the database design at the conceptual level
The Internal Model
* Representation of the database as �seen� by the DBMS
* Adapts the conceptual model to the DBMS
* Software dependent
* Hardware independent
The External Model
* End users� view of the data environment
* Requires that the modeler subdivide set of requirements and constraints into functional modules that can be examined within the framework of their external models
* Good design should:
o Consider such relationships between views
o Provide programmers with a set of restrictions that govern common entities
Division of Conceptual Model into External Models
Fig 2-13
Advantages of External Models
* Use of database subsets makes application program development much simpler
o Facilitates designer�s task by making it easier to identify specific data required to support each business unit�s operations
o Provides feedback about the conceptual model�s adequacy
* Creation of external models helps to ensure security constraints in the database design
External Models for Tiny College
The Physical Model
* Operates at lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes
* Software and hardware dependent
* Requires that database designers have a detailed knowledge of the hardware and software used to implement database design
Levels of Data Abstraction
Summary
* A good DBMS will perform poorly with a poorly designed database
* A data model is a (relatively) simple abstraction of a complex real-world data-gathering environment
* Basic data modeling components are
o Entities
o Attributes
o Relationships
Note There's minimal coverage of Hierarichcal and Network models as I consider them of historical importance only
Topics
* Business rules
* Data models
o About the basic data-modeling building blocks
o How the major data models evolved, and their advantages and disadvantages
o How data models can be classified by level of abstraction
Business Rules
* A policy, procedure, or principle within a specific organization's environment
* Applies to any organization that stores and uses data to generate information
* Description of operations that help to create and enforce actions within that organization's environment
* Sometimes are external to the organization
* Describe characteristics of the data as viewed by the company
Business Rules (cont)
* Must be rendered in writing
* Brief, precise, unambiguous
* Must be kept up to date
* Must be easy to understand and widely disseminated
Sources of Business Rules
* Company managers
* Policy makers
* Department managers
* Written documentation
o Procedures
o Standards
o Operations manuals
* Direct interviews with end users
Importance of Business Rules
* Promote creation of an accurate data model
* Standardize company's view of data
* Constitute a communications tool between users and designers
* Allow designer to understand the nature, role, and scope of data
* Allow designer to understand business processes
* Allow designer to develop appropriate relationship participation rules and constraints
The Importance of Data Models
* Good database design uses an appropriate data model as its foundation
* End-users have different views and needs for data
* Data model organizes data for various users
Data model
Relatively simple representation, usually graphical, of complex real-world data structures
Communications tool to facilitate interaction among the designer, the applications programmer, and the end user
Data Model Building Blocks
Entity
Anything about which data are to be collected and stored
Attribute
A characteristic of an entity (e.g. last name)
Relationship
an association among (two or more) entities
* One-to-many (1:M) relationship
* Many-to-many (M:N or M:M) relationship
* One-to-one (1:1) relationship
The Evolution of Data Models
* Hierarchical
* Network
* Relational
* Entity relationship
* Object oriented
Crucial Database Components
Schema
Conceptual organization of entire database as viewed by the database administrator
Subschema
Defines database portion "seen" by the application programs that actually produce the desired information from data contained within the database
Data Definition Language (DDL)
Define data characteristics and data structure in order to manipulate the data
DML Concepts
Schema Data Definition Language (DDL)
Enables database administrator to define schema components and relationships
Subschema DDL
Allows application programs to define database components that will be used
DML
Manipulates database contents
The Relational Model
* Developed by Codd (IBM) in 1970
* Considered ingenious but impractical in 1970
* Conceptually simple
* Computers lacked power to implement the relational model
* Today, microcomputers can run sophisticated relational database software
Relational Model: Basic Structure
* Relational Database Management System (RDBMS)
* Most important advantage of the RDBMS is its ability to let the user/designer operate in a human logical environment
Relation (Table)
Matrix consisting of a series of row/column intersections
Related to each other by sharing a common entity characteristic
Relational schema
Representation of relational database's entities, attributes within those entities, and relationships between those entities
Represented as DDL or Visually
Alternative Terminology
Relational Common Traditional
relation table file
tuple row record
attribute cell field
instance value value
Relational Schema
Relational Schema
Linking Relational Tables
Fig 2.4 Linking Relations
Advantages of the Relational Model
* Structural independence
* Improved conceptual simplicity
* Easier database design, implementation, management, and use
* Ad hoc query capability
* Powerful database management system
Disadvantages of the Relational Model
* Substantial hardware and system software overhead
* May not fit all business models
* Can facilitate poor design and implementation
* May promote "islands of information" problems
The Entity Relationship Model
* Widely accepted and adapted graphical tool for data modeling
* Introduced by Chen in 1976
* Graphical representation of entities and their relationships in a database structure
Basic Structure of the Entity Relationship Model
* Entity relationship diagram (ERD)
o Uses graphic representations to model database components
o Entity is mapped to a relational table
* Entity instance (or occurrence) is row in table
* Entity set is collection of like entities
* Connectivity labels types of relationships
Relationships: the Basic Chen ERD
Note: we will be using mostly CrowsFoot notation
Fig 2.6 Chen ERD
Crow's Foot ERD
Fig 2.7 Crow's Foot ERD
Advantages of E.R. Model
* Exceptional conceptual simplicity
* Visual representation
* Effective communication tool
* Integrated with the relational data model
Disadvantages of E.R. Model
* Limited constraint representation
* Limited relationship representation
* No data manipulation language
* Loss of information content
The Object Oriented Model
* Semantic data model (SDM) developed by Hammer and McLeod in 1981
* Modeled both data and their relationships in a single structure known as an object
* Basis of object oriented data model (OODM)
* OODM becomes the basis for the object oriented database management system (OODBMS)
Object Oriented Model (cont)
* Object is described by its factual content
o Like relational model's entity
* Includes information about relationships between facts within object and relationships with other objects
o Unlike relational model's entity
* Subsequent OODM development allowed an object to also contain operations (aka methods)
* Object becomes basic building block for autonomous structures
Developments that Boosted OODM
* Growing costs put a premium on code reusability
* Complex data types and system requirements became difficult to manage with a traditional RDBMS
* Became possible to support increasingly sophisticated transaction & information requirements
* Ever-increasing computing power made it possible to support the large computing overhead required
OODM: Basic Structure
* Object: abstraction of a real-world entity
* Attributes describe the properties of an object
* Objects that share similar characteristics are grouped in classes
* Classes are organized in a class hierarchy
* Inheritance is the ability of an object within the class hierarchy to inherit the attributes and methods of classes above it
Comparison of OO and ER Models
Fig 2.6 Comparison between OO and ER Model
OO Model Advantages
* Adds semantic content
* Visual presentation includes semantic content
* Database integrity
* Both structural and data independence
OO Model Disadvantages
* Slow pace of OODM standards development
* Complex navigational data access
* Steep learning curve
* High system overhead slows transactions
* Lack of market penetration
* Current relational DBMS's have many OO characteristics
Other Models: Extended Relational Data Model (ERDM)
* Semantic data model developed in response to increasing complexity of applications
* DBMS based on the ERDM often described as an object/relational database management system (O/RDBMS)
* Primarily geared to business applications
Date's Objection to ERDM
* Given proper support for domains, relational data models are quite capable of handling complex data
o Therefore, capability that is supposedly being extended is already there
* O/RDM label is not accurate because the relational data model's domain is not an object model structure
Data Models: A Summary
* Each new data model was intended to overcome the shortcomings of previous models
* Common characteristics:
o Conceptual simplicity without compromising the semantic completeness of the database
o Represent the real world as closely as possible
o Representation of real-world transformations (behavior) must be in compliance with consistency and integrity characteristics of any data model
Degrees of Abstraction
* Way of classifying data models
* Many processes begin at high level of abstraction and proceed to an ever-increasing level of detail
* Designing a usable database follows the same basic process
ANSI/SPARC Data Models
American National Standards Institute/Standards Planning and Requirements Committee (ANSI/SPARC)
Classified data models according to their degree of abstraction (1970s):
* Conceptual
* External
* Internal
Data Abstraction Levels
Data Abstraction Levels
The Conceptual Model
* Represents global view of the database
* Enterprise-wide representation of data as viewed by high-level managers
* Basis for identification and description of main data objects, avoiding details
* Most widely used conceptual model is the entity relationship (ER) model
Conceptual Model Tiny College
Advantages of Conceptual Model
* Provides a relatively easily understood macro level view of data environment
* Independent of both software and hardware
o Does not depend on the DBMS software used to implement the model
o Does not depend on the hardware used in the implementation of the model
o Changes in either the hardware or the DBMS software have no effect on the database design at the conceptual level
The Internal Model
* Representation of the database as �seen� by the DBMS
* Adapts the conceptual model to the DBMS
* Software dependent
* Hardware independent
The External Model
* End users� view of the data environment
* Requires that the modeler subdivide set of requirements and constraints into functional modules that can be examined within the framework of their external models
* Good design should:
o Consider such relationships between views
o Provide programmers with a set of restrictions that govern common entities
Division of Conceptual Model into External Models
Fig 2-13
Advantages of External Models
* Use of database subsets makes application program development much simpler
o Facilitates designer�s task by making it easier to identify specific data required to support each business unit�s operations
o Provides feedback about the conceptual model�s adequacy
* Creation of external models helps to ensure security constraints in the database design
External Models for Tiny College
The Physical Model
* Operates at lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes
* Software and hardware dependent
* Requires that database designers have a detailed knowledge of the hardware and software used to implement database design
Levels of Data Abstraction
Summary
* A good DBMS will perform poorly with a poorly designed database
* A data model is a (relatively) simple abstraction of a complex real-world data-gathering environment
* Basic data modeling components are
o Entities
o Attributes
o Relationships
Chapter 1
File Systems and Databases
- Database
- Persistent collection of data and
- Metadata (data about the characteristics of the data and relationships
of the data)
- Database Management System
- collection of programs to manage the database
- Manages and enforces database structure
- Includes interface to manipulate the data
- Allows data to be shared among multiple users
Data vs Information
- Data:
- Raw facts; building blocks of information
- Unprocessed information
- Raw facts; building blocks of information
- Information:
- Data processed to reveal meaning
- Data processed to reveal meaning
- Accurate, relevant, and timely information is key to good decision making
- Good decision making is key to survival in global environment
Importance of DBMS
- Makes data management more efficient and effective
- Query language allows quick answers to ad hoc queries
- Provides easier access to more and better-managed data
- Promotes an integrated view of organization�s operations
- Reduces the chance of inconsistent data
- Helps protect against loss of data
DBMS Manages Interface Between Data and Users
Database Design: Why Design is important
- The database is the foundation of the information system.
- Design should reflect the expected use
- Poor design results in unwanted redundancy
- Poor design leads to inconsistent data
- Poor design leads to poor performance
- Poor design leads to improper information systems operation
Historical Roots
- First business computer applications focused on clerical tasks
- Requests for information quickly followed
- File systems developed to address needs
- Data organized according to expected use
- Data Processing (DP) specialists computerized manual file systems
File Systems

File System Data Management
- Requires extensive programming, typically in Third Generation Language
(3GL)
- Leads to islands of information and data redundancy
- Difficult to make ad hoc queries to obtain information
- Difficult to maintain data integrity
Data and Structural Dependence
- Data characteristics are embodied in programs not stored with the data.
- Changes in data characteristics requires modifying programs
- Changes in file structures require modification of related programs
Data Redundancy
- Different and possibly conflicting versions of same data
- Results in problems during data:
- Modification (e.g. address changes)
- Insertion
- Deletion
- Modification (e.g. address changes)
- Data inconsistency: Lack of integrity
Database Systems
- Database consists of logically related data stored in a single repository
- Advantages over file system management approach:
- Eliminates inconsistency, data anomalies, data dependency, and structural
dependency problems
- Stores data structures, relationships, and access paths
- Eliminates inconsistency, data anomalies, data dependency, and structural
Database vs. File Systems
Database System Environment

Database System Types
- Scale
- Single User (desktop)
- Workgroup
- Enterprise
- Distributed or Federated
- Single User (desktop)
- Use
- Production/Transaction
- Decision Support/Data Warehouse
- Production/Transaction
Uses of Databases
- Transactional (or production):
- Supports a company�s day-to-day operations
- Supports a company�s day-to-day operations
- Data warehouse:
- Stores data used to generate information required to make tactical or strategic decisions
- Such decisions typically require �data massaging�
- Often used to store historical data
- Structure is quite different
- Stores data used to generate information required to make tactical or strategic decisions
DBMS Functions
- Metadata/Data Dictionary Management
- Data storage management
- Data transformation and presentation
- Security management and Multiuser access control
- Backup and recovery management
- Data integrity management
- Database language and application programming interfaces
- Database communication interfaces
Database Models
Collection of logical constructs used to represent data structure and relationships
- Conceptual Models: logical nature of data representation
- Implementation Models: how data are represented
Database Models
The hierarchical and network models are of historical interest only.
Database Models:
- Relational
- Entity-Relationship
- Object oriented
Relational Model
- Most common model
- Perceived by user as collection of tables containing data
- Actually has a formal definitin based on set theory
- Tables are a series of row/column intersections
- Tables related by sharing common entity characteristic(s)
Relational Database

Relational Database Model Advantages
- Structural independence
- Improved conceptual simplicity
- Easier database design, implementation, management, and use
- Ad hoc query capability with SQL (standard interface)
- Powerful database management system
Relational Database Model Disadvantages
- Substantial hardware and system software overhead
- Poor design and implementation is made easy
- Not a cure all: May promote "islands of information" problems
- SQL is not completely standardized. One DBMS is not a "drop in"
replacement for another.
- May have problems storing some types of data
Entity Relationship Database Model
- Primarily a database design tool.
- Complements the relational data model concepts
- Represented in an entity relationship diagram (ERD)
- Based on entities, attributes, and relationships
ER-Diagram

ER Model Advantages
- Conceptual simplicity
- Visual representation
- Effective communication tool
- Integrated with the relational database model
ER Model Disadvatages
- Limited constraint representation
- Limited relationship representation
- No data manipulation language
- Loss of information content
- May be overly complex for end users
Object-Oriented Model
- Objects or abstractions of real-world entities are stored
- Attributes describe properties
- Collection of similar objects is a class
- Methods represent real world actions of classes
- Classes are organized in a class hierarchy
- Objects inherit attributes and methods of classes above.
Object Oriented Model

OO Model Advantages
- Adds semantic context
- Structural and data independence
- May mesh well with Object Oriented Programming
OO Model Disadvantages
- Lack of standards in model
- Lack of standard manipulation languages
- Complex navigational data access
- Steep learning curve
- Poor performance
Thursday, September 11, 2008
Interview questions for Linux admin
Advantages/disadvantages of script vs compiled program.
Name a replacement for PHP/Perl/MySQL/Linux/Apache and show main differences. .Why have you choosen such a combination of products?
Differences between two last MySQL versions. Which one would you choose and when/why?
Main differences between Apache 1.x and 2.x. Why is 2.x not so popular? Which one would you choose and when/why?
Which Linux distros do you have experience with?
Which distro you prefer? Why?
Which tool would you use to update Debian / Slackware / RedHat / Mandrake / SuSE ?
You’re asked to write an Apache module. What would you do?
Which tool do you prefer for Apache log reports?
Your portfolio. (even a PHP guest book may work well)
What does ‘route’ command do?
Differences between ipchains and iptables.
What’s eth0, ppp0, wlan0, ttyS0, etc.
What are different directories in / for?
Partitioning scheme for new webserver. Why?
Name a replacement for PHP/Perl/MySQL/Linux/Apache and show main differences. .Why have you choosen such a combination of products?
Differences between two last MySQL versions. Which one would you choose and when/why?
Main differences between Apache 1.x and 2.x. Why is 2.x not so popular? Which one would you choose and when/why?
Which Linux distros do you have experience with?
Which distro you prefer? Why?
Which tool would you use to update Debian / Slackware / RedHat / Mandrake / SuSE ?
You’re asked to write an Apache module. What would you do?
Which tool do you prefer for Apache log reports?
Your portfolio. (even a PHP guest book may work well)
What does ‘route’ command do?
Differences between ipchains and iptables.
What’s eth0, ppp0, wlan0, ttyS0, etc.
What are different directories in / for?
Partitioning scheme for new webserver. Why?
Unix sysadmin interview questions
1. How would you make the following SQL statement run faster? SELECT * FROM TABLEA WHERE COL1=’A’ AND COL2=’B'; A: Make sure that COL1 and COL2 have indexes.Find out which condition will return less values and use that as the first conditonal.
2. What is Data Mining A: Data Minig is the process of sifting through extremeley large amounts of Data to find trends or relevent information.
3. Name the Seven layers in the OSI Model. A: Appication, Presentation, Session, Transport, Network, Data Link, Phyiscal
4. What is one way to view a unix network share on a Windows computer, within explorer A: NFS, The Unix computer can be running a NFS Server Daemon.
5. How would you find all the processes running on your computer. A: Unix, is ps -ef or ps -aux depending on version.
6. What is DHCP A: DHCP is a way to dynamically assign IP address to computers. Dyanmic Host Configuration Protocol
7. What is HTTP Tunneling A: HTTP Tunneling is a security method that encryptes packets traveling throught the internet. Only the intended reciepent should be able to decrypt the packets. Can be used to Create Virtual Private Networks. (VPN)
8. Scenario: You have 9 identical looking balls, however one ball is heavier than the others. You have two chances to use a balance. How do you find out which ball is the heaviest? A: Split into groups of three, randomly choose two groups and use balance on them. If one group is heavier, then discard the other 6 balls. If the two groups are the same weight. The heavier ball must be in the group that was not on the scale. Now randomly choose two balls and test on balance. If they are the same weight, the heaviest ball is on one that was not tested. Else the heaviest ball is already known from the balance
2. What is Data Mining A: Data Minig is the process of sifting through extremeley large amounts of Data to find trends or relevent information.
3. Name the Seven layers in the OSI Model. A: Appication, Presentation, Session, Transport, Network, Data Link, Phyiscal
4. What is one way to view a unix network share on a Windows computer, within explorer A: NFS, The Unix computer can be running a NFS Server Daemon.
5. How would you find all the processes running on your computer. A: Unix, is ps -ef or ps -aux depending on version.
6. What is DHCP A: DHCP is a way to dynamically assign IP address to computers. Dyanmic Host Configuration Protocol
7. What is HTTP Tunneling A: HTTP Tunneling is a security method that encryptes packets traveling throught the internet. Only the intended reciepent should be able to decrypt the packets. Can be used to Create Virtual Private Networks. (VPN)
8. Scenario: You have 9 identical looking balls, however one ball is heavier than the others. You have two chances to use a balance. How do you find out which ball is the heaviest? A: Split into groups of three, randomly choose two groups and use balance on them. If one group is heavier, then discard the other 6 balls. If the two groups are the same weight. The heavier ball must be in the group that was not on the scale. Now randomly choose two balls and test on balance. If they are the same weight, the heaviest ball is on one that was not tested. Else the heaviest ball is already known from the balance
Unix/Linux programming interview questions
1. What is the major advantage of a hash table? (Asked by Silicon Magic Corp. people) Answer: The major advantage of a hash table is its speed. Because the hash function is to take a range of key values and transform them into index values in such a way that the key values are distributed randomly across all the indices of a hash table.
2. What are the techniques that you use to handle the collisions in hash tables?(Asked by Silicon Magic Corp. people)
Answer: We can use two major techniques to handle the collisions. They are open addressing and separate chaining. In open addressing, data items that hash to a full array cell are placed in another cell in the array. In separate chaining, each array element consist of a linked list. All data items hashing to a given array index are inserted in that list.
3. In Unix OS, what is the file server? (Asked by Silicon Magic Corp. people)
Answer: The file server is a machine that shares its disk storage and files with other machines on the network.
4. What is NFS? What is its job?(Asked by Silicon Magic Corp. people)
Answer: NFS stands for Network File System. NFS enables filesystems physically residing on one computer system to be used by other computers in the network, appearing to users on the remote host as just another local disk.
5. What is CVS? List some useful CVS commands.(Asked by Silicon Magic Corp.people)
Anser: CVS is Concurrent Version System. It is the front end to the RCS revision control system which extends the notion of revision control from a collection of files in a single directory to a hierarchical collection of directories consisting of revision controlled files. These directories and files can be combined together to form a software release. There are some useful commands that are being used very often. They are
cvs checkout cvs update cvs add cvs remove cvs commit
2. What are the techniques that you use to handle the collisions in hash tables?(Asked by Silicon Magic Corp. people)
Answer: We can use two major techniques to handle the collisions. They are open addressing and separate chaining. In open addressing, data items that hash to a full array cell are placed in another cell in the array. In separate chaining, each array element consist of a linked list. All data items hashing to a given array index are inserted in that list.
3. In Unix OS, what is the file server? (Asked by Silicon Magic Corp. people)
Answer: The file server is a machine that shares its disk storage and files with other machines on the network.
4. What is NFS? What is its job?(Asked by Silicon Magic Corp. people)
Answer: NFS stands for Network File System. NFS enables filesystems physically residing on one computer system to be used by other computers in the network, appearing to users on the remote host as just another local disk.
5. What is CVS? List some useful CVS commands.(Asked by Silicon Magic Corp.people)
Anser: CVS is Concurrent Version System. It is the front end to the RCS revision control system which extends the notion of revision control from a collection of files in a single directory to a hierarchical collection of directories consisting of revision controlled files. These directories and files can be combined together to form a software release. There are some useful commands that are being used very often. They are
cvs checkout cvs update cvs add cvs remove cvs commit
Unix/Linux administration interview questions
1.What is LILO?
LILO stands for Linux boot loader. It will load the MBR, master boot record, into the memory, and tell the system which partition and hard drive to boot from.
2.What is the main advantage of creating links to a file instead of copies of the file?
A: The main advantage is not really that it saves disk space (though it does that too) but, rather, that a change of permissions on the file is applied to all the link access points. The link will show permissions of lrwxrwxrwx but that is for the link itself and not the access to the file to which the link points. Thus if you want to change the permissions for a command, such as su, you only have to do it on the original. With copies you have to find all of the copies and change permission on each of the copies.
3.Write a command to find all of the files which have been accessed within the last 30 days.
find / -type f -atime -30 > December.files
This command will find all the files under root, which is ‘/’, with file type is file. ‘-atime -30′ will give all the files accessed less than 30 days ago. And the output will put into a file call December.files.
4.What is the most graceful way to get to run level single user mode?
A: The most graceful way is to use the command init s.If you want to shut everything down before going to single user mode then do init 0 first and from the ok prompt do a boot -s.
5.What does the following command line produce? Explain each aspect of this line.
$ (date ; ps -ef awk ‘{print $1}’ sort uniq wc -l ) >> Activity.log
A: First let’s dissect the line: The date gives the date and time as the first command of the line, this is followed by the a list of all running processes in long form with UIDs listed first, this is the ps -ef. These are fed into the awk which filters out all but the UIDs; these UIDs are piped into sort for no discernible reason and then onto uniq (now we see the reason for the sort - uniq only works on sorted data - if the list is A, B, A, then A, B, A will be the output of uniq, but if it’s A, A, B then A, B is the output) which produces only one copy of each UID.
These UIDs are fed into wc -l which counts the lines - in this case the number of distinct UIDs running processes on the system. Finally the results of these two commands, the date and the wc -l, are appended to the file "Activity.log". Now to answer the question as to what this command line produces. This writes the date and time into the file Activity.log together with the number of distinct users who have processes running on the system at that time. If the file already exists, then these items are appended to the file, otherwise the file is created.
LILO stands for Linux boot loader. It will load the MBR, master boot record, into the memory, and tell the system which partition and hard drive to boot from.
2.What is the main advantage of creating links to a file instead of copies of the file?
A: The main advantage is not really that it saves disk space (though it does that too) but, rather, that a change of permissions on the file is applied to all the link access points. The link will show permissions of lrwxrwxrwx but that is for the link itself and not the access to the file to which the link points. Thus if you want to change the permissions for a command, such as su, you only have to do it on the original. With copies you have to find all of the copies and change permission on each of the copies.
3.Write a command to find all of the files which have been accessed within the last 30 days.
find / -type f -atime -30 > December.files
This command will find all the files under root, which is ‘/’, with file type is file. ‘-atime -30′ will give all the files accessed less than 30 days ago. And the output will put into a file call December.files.
4.What is the most graceful way to get to run level single user mode?
A: The most graceful way is to use the command init s.If you want to shut everything down before going to single user mode then do init 0 first and from the ok prompt do a boot -s.
5.What does the following command line produce? Explain each aspect of this line.
$ (date ; ps -ef awk ‘{print $1}’ sort uniq wc -l ) >> Activity.log
A: First let’s dissect the line: The date gives the date and time as the first command of the line, this is followed by the a list of all running processes in long form with UIDs listed first, this is the ps -ef. These are fed into the awk which filters out all but the UIDs; these UIDs are piped into sort for no discernible reason and then onto uniq (now we see the reason for the sort - uniq only works on sorted data - if the list is A, B, A, then A, B, A will be the output of uniq, but if it’s A, A, B then A, B is the output) which produces only one copy of each UID.
These UIDs are fed into wc -l which counts the lines - in this case the number of distinct UIDs running processes on the system. Finally the results of these two commands, the date and the wc -l, are appended to the file "Activity.log". Now to answer the question as to what this command line produces. This writes the date and time into the file Activity.log together with the number of distinct users who have processes running on the system at that time. If the file already exists, then these items are appended to the file, otherwise the file is created.
SQL Server interview questions
How do you read transaction logs?
How do you reset or reseed the IDENTITY column?
How do you persist objects, permissions in tempdb?
How do you simulate a deadlock for testing purposes?
How do you rename an SQL Server computer?
How do you run jobs from T-SQL?
How do you restore single tables from backup in SQL Server 7.0/2000? In SQL Server 6.5?
Where to get the latest MDAC from?
I forgot/lost the sa password. What do I do?
I have only the .mdf file backup and no SQL Server database backups. Can I get my database back into SQL Server?
How do you add a new column at a specific position (say at the beginning of the table or after the second column) using ALTER TABLE command?
How do you change or alter a user defined data type?
How do you rename an SQL Server 2000 instance?
How do you capture/redirect detailed deadlock information into the error logs?
How do you remotely administer SQL Server?
What are the effects of switching SQL Server from ‘Mixed mode’ to ‘Windows only’ authentication mode? What are the steps required, to not break existing applications?
Is there a command to list all the tables and their associated filegroups?
How do you ship the stored procedures, user defined functions (UDFs), triggers, views of my application, in an encrypted form to my clients/customers? How do you protect intellectual property?
How do you archive data from my tables? Is there a built-in command or tool for this?
How do you troubleshoot ODBC timeout expired errors experienced by applications accessing SQL Server databases?
How do you restart SQL Server service automatically at regular intervals?
What is the T-SQL equivalent of IIF (immediate if/ternary operator) function of other programming languages?
How do you programmatically find out when the SQL Server service started?
How do you get rid of the time part from the date returned by GETDATE function?
How do you upload images or binary files into SQL Server tables?
How do you run an SQL script file that is located on the disk, using T-SQL?
How do you get the complete error message from T-SQL while error handling?
How do you get the first day of the week, last day of the week and last day of the month using T-SQL date functions?
How do you pass a table name, column name etc. to the stored procedure so that I can dynamically select from a table?
Error inside a stored procedure is not being raised to my front-end applications using ADO. But I get the error when I run the procedure from Query Analyzer.
How do you suppress error messages in stored procedures/triggers etc. using T-SQL?
How do you save the output of a query/stored procedure to a text file?
How do you join tables from different databases?
How do you join tables from different servers?
How do you convert timestamp data to date data (datetime datatype)?
Can I invoke/instantiate COM objects from within stored procedures or triggers using T-SQL?
Oracle has a rownum to access rows of a table using row number or row id. Is there any equivalent for that in SQL Server? Or How do you generate output with row number in SQL Server?
How do you specify a network library like TCP/IP using ADO connect string?
How do you generate scripts for repetitive tasks like truncating all the tables in a database, changing owner of all the database objects, disabling constraints on all tables etc?
Is there a way to find out when a stored procedure was last updated?
How do you find out all the IDENTITY columns of all the tables in a given database?
How do you search the code of stored procedures?
How do you retrieve the generated GUID value of a newly inserted row? Is there an @@GUID, just like @@IDENTITY?
How do you reset or reseed the IDENTITY column?
How do you persist objects, permissions in tempdb?
How do you simulate a deadlock for testing purposes?
How do you rename an SQL Server computer?
How do you run jobs from T-SQL?
How do you restore single tables from backup in SQL Server 7.0/2000? In SQL Server 6.5?
Where to get the latest MDAC from?
I forgot/lost the sa password. What do I do?
I have only the .mdf file backup and no SQL Server database backups. Can I get my database back into SQL Server?
How do you add a new column at a specific position (say at the beginning of the table or after the second column) using ALTER TABLE command?
How do you change or alter a user defined data type?
How do you rename an SQL Server 2000 instance?
How do you capture/redirect detailed deadlock information into the error logs?
How do you remotely administer SQL Server?
What are the effects of switching SQL Server from ‘Mixed mode’ to ‘Windows only’ authentication mode? What are the steps required, to not break existing applications?
Is there a command to list all the tables and their associated filegroups?
How do you ship the stored procedures, user defined functions (UDFs), triggers, views of my application, in an encrypted form to my clients/customers? How do you protect intellectual property?
How do you archive data from my tables? Is there a built-in command or tool for this?
How do you troubleshoot ODBC timeout expired errors experienced by applications accessing SQL Server databases?
How do you restart SQL Server service automatically at regular intervals?
What is the T-SQL equivalent of IIF (immediate if/ternary operator) function of other programming languages?
How do you programmatically find out when the SQL Server service started?
How do you get rid of the time part from the date returned by GETDATE function?
How do you upload images or binary files into SQL Server tables?
How do you run an SQL script file that is located on the disk, using T-SQL?
How do you get the complete error message from T-SQL while error handling?
How do you get the first day of the week, last day of the week and last day of the month using T-SQL date functions?
How do you pass a table name, column name etc. to the stored procedure so that I can dynamically select from a table?
Error inside a stored procedure is not being raised to my front-end applications using ADO. But I get the error when I run the procedure from Query Analyzer.
How do you suppress error messages in stored procedures/triggers etc. using T-SQL?
How do you save the output of a query/stored procedure to a text file?
How do you join tables from different databases?
How do you join tables from different servers?
How do you convert timestamp data to date data (datetime datatype)?
Can I invoke/instantiate COM objects from within stored procedures or triggers using T-SQL?
Oracle has a rownum to access rows of a table using row number or row id. Is there any equivalent for that in SQL Server? Or How do you generate output with row number in SQL Server?
How do you specify a network library like TCP/IP using ADO connect string?
How do you generate scripts for repetitive tasks like truncating all the tables in a database, changing owner of all the database objects, disabling constraints on all tables etc?
Is there a way to find out when a stored procedure was last updated?
How do you find out all the IDENTITY columns of all the tables in a given database?
How do you search the code of stored procedures?
How do you retrieve the generated GUID value of a newly inserted row? Is there an @@GUID, just like @@IDENTITY?
SQL Server, DBA interview questions
Questions are categorized under the following sections, for your convenience:
Database design (8 questions)
SQL Server architecture (12 questions)
Database administration (13 questions)
Database programming (10 questions)
Database design
What is normalization? Explain different levels of normalization?
Check out the article Q100139 from Microsoft knowledge base and of course, there’s much more information available in the net. It’ll be a good idea to get a hold of any RDBMS fundamentals text book, especially the one by C. J. Date. Most of the times, it will be okay if you can explain till third normal form.
What is denormalization and when would you go for it?
As the name indicates, denormalization is the reverse process of normalization. It’s the controlled introduction of redundancy in to the database design. It helps improve the query performance as the number of joins could be reduced.
How do you implement one-to-one, one-to-many and many-to-many relationships while designing tables?
One-to-One relationship can be implemented as a single table and rarely as two tables with primary and foreign key relationships. One-to-Many relationships are implemented by splitting the data into two tables with primary key and foreign key relationships. Many-to-Many relationships are implemented using a junction table with the keys from both the tables forming the composite primary key of the junction table. It will be a good idea to read up a database designing fundamentals text book.
What’s the difference between a primary key and a unique key?
Both primary key and unique enforce uniqueness of the column on which they are defined. But by default primary key creates a clustered index on the column, where are unique creates a nonclustered index by default. Another major difference is that, primary key doesn’t allow NULLs, but unique key allows one NULL only.
What are user defined datatypes and when you should go for them?
User defined datatypes let you extend the base SQL Server datatypes by providing a descriptive name, and format to the database. Take for example, in your database, there is a column called Flight_Num which appears in many tables. In all these tables it should be varchar(8). In this case you could create a user defined datatype called Flight_num_type of varchar(8) and use it across all your tables. See sp_addtype, sp_droptype in books online.
What is bit datatype and what’s the information that can be stored inside a bit column?
Bit datatype is used to store boolean information like 1 or 0 (true or false). Untill SQL Server 6.5 bit datatype could hold either a 1 or 0 and there was no support for NULL. But from SQL Server 7.0 onwards, bit datatype can represent a third state, which is NULL.
Define candidate key, alternate key, composite key.
A candidate key is one that can identify each row of a table uniquely. Generally a candidate key becomes the primary key of the table. If the table has more than one candidate key, one of them will become the primary key, and the rest are called alternate keys. A key formed by combining at least two or more columns is called composite key.
What are defaults? Is there a column to which a default can’t be bound?
A default is a value that will be used by a column, if no value is supplied to that column while inserting data. IDENTITY columns and timestamp columns can’t have defaults bound to them. See CREATE DEFAULT in books online.
What is a transaction and what are ACID properties?
A transaction is a logical unit of work in which, all the steps must be performed or none. ACID stands for Atomicity, Consistency, Isolation, Durability. These are the properties of a transaction. For more information and explanation of these properties, see SQL Server books online or any RDBMS fundamentals text book. Explain different isolation levels An isolation level determines the degree of isolation of data between concurrent transactions. The default SQL Server isolation level is Read Committed. Here are the other isolation levels (in the ascending order of isolation): Read Uncommitted, Read Committed, Repeatable Read, Serializable. See SQL Server books online for an explanation of the isolation levels. Be sure to read about SET TRANSACTION ISOLATION LEVEL, which lets you customize the isolation level at the connection level. Read Committed - A transaction operating at the Read Committed level cannot see changes made by other transactions until those transactions are committed. At this level of isolation, dirty reads are not possible but nonrepeatable reads and phantoms are possible. Read Uncommitted - A transaction operating at the Read Uncommitted level can see uncommitted changes made by other transactions. At this level of isolation, dirty reads, nonrepeatable reads, and phantoms are all possible. Repeatable Read - A transaction operating at the Repeatable Read level is guaranteed not to see any changes made by other transactions in values it has already read. At this level of isolation, dirty reads and nonrepeatable reads are not possible but phantoms are possible. Serializable - A transaction operating at the Serializable level guarantees that all concurrent transactions interact only in ways that produce the same effect as if each transaction were entirely executed one after the other. At this isolation level, dirty reads, nonrepeatable reads, and phantoms are not possible.
CREATE INDEX myIndex ON myTable(myColumn)What type of Index will get created after executing the above statement?
Non-clustered index. Important thing to note: By default a clustered index gets created on the primary key, unless specified otherwise.
What’s the maximum size of a row?
8060 bytes. Don’t be surprised with questions like ‘what is the maximum number of columns per table’. 1024 columns per table. Check out SQL Server books online for the page titled: "Maximum Capacity Specifications". Explain Active/Active and Active/Passive cluster configurations Hopefully you have experience setting up cluster servers. But if you don’t, at least be familiar with the way clustering works and the two clusterning configurations Active/Active and Active/Passive. SQL Server books online has enough information on this topic and there is a good white paper available on Microsoft site. Explain the architecture of SQL Server This is a very important question and you better be able to answer it if consider yourself a DBA. SQL Server books online is the best place to read about SQL Server architecture. Read up the chapter dedicated to SQL Server Architecture.
What is lock escalation?
Lock escalation is the process of converting a lot of low level locks (like row locks, page locks) into higher level locks (like table locks). Every lock is a memory structure too many locks would mean, more memory being occupied by locks. To prevent this from happening, SQL Server escalates the many fine-grain locks to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server 6.5, but from SQL Server 7.0 onwards it’s dynamically managed by SQL Server.
What’s the difference between DELETE TABLE and TRUNCATE TABLE commands?
DELETE TABLE is a logged operation, so the deletion of each row gets logged in the transaction log, which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it won’t log the deletion of each row, instead it logs the deallocation of the data pages of the table, which makes it faster. Of course, TRUNCATE TABLE can be rolled back. TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE. The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table’s data, and only the page deallocations are recorded in the transaction log. TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement. You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger. TRUNCATE TABLE may not be used on tables participating in an indexed view
Explain the storage models of OLAP
Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more infomation.
What are the new features introduced in SQL Server 2000 (or the latest release of SQL Server at the time of your interview)? What changed between the previous version of SQL Server and the current version?
This question is generally asked to see how current is your knowledge. Generally there is a section in the beginning of the books online titled "What’s New", which has all such information. Of course, reading just that is not enough, you should have tried those things to better answer the questions. Also check out the section titled "Backward Compatibility" in books online which talks about the changes that have taken place in the new version.
What are constraints? Explain different types of constraints.
Constraints enable the RDBMS enforce the integrity of the database automatically, without needing you to create triggers, rule or defaults. Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY. For an explanation of these constraints see books online for the pages titled: "Constraints" and "CREATE TABLE", "ALTER TABLE"
What is an index? What are the types of indexes? How many clustered indexes can be created on a table? I create a separate index on each column of a table. What are the advantages and disadvantages of this approach?
Indexes in SQL Server are similar to the indexes in books. They help SQL Server retrieve the data quicker. Indexes are of two types. Clustered indexes and non-clustered indexes. When you create a clustered index on a table, all the rows in the table are stored in the order of the clustered index key. So, there can be only one clustered index per table. Non-clustered indexes have their own storage separate from the table data storage. Non-clustered indexes are stored as B-tree structures (so do clustered indexes), with the leaf level nodes having the index key and it’s row locater. The row located could be the RID or the Clustered index key, depending up on the absence or presence of clustered index on the table. If you create an index on each column of a table, it improves the query performance, as the query optimizer can choose from all the existing indexes to come up with an efficient execution plan. At the same t ime, data modification operations (such as INSERT, UPDATE, DELETE) will become slow, as every time data changes in the table, all the indexes need to be updated. Another disadvantage is that, indexes need disk space, the more indexes you have, more disk space is used.
What is RAID and what are different types of RAID configurations?
RAID stands for Redundant Array of Inexpensive Disks, used to provide fault tolerance to database servers. There are six RAID levels 0 through 5 offering different levels of performance, fault tolerance. MSDN has some information about RAID levels and for detailed information, check out the RAID advisory board’s homepage
What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the poor performance of a query. But some general issues that you could talk about would be: No indexes, table scans, missing or out of date statistics, blocking, excess recompilations of stored procedures, procedures and triggers without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins, too much normalization, excess usage of cursors and temporary tables. Some of the tools/ways that help you troubleshooting performance problems are: SET SHOWPLAN_ALL ON, SET SHOWPLAN_TEXT ON, SET STATISTICS IO ON, SQL Server Profiler, Windows NT /2000 Performance monitor, Graphical execution plan in Query Analyzer. Download the white paper on performance tuning SQL Server from Microsoft web site. Don’t forget to check out sql-server-performance.com
What are the steps you will take, if you are tasked with securing an SQL Server?
Again this is another open ended question. Here are some things you could talk about: Preferring NT authentication, using server, databse and application roles to control access to the data, securing the physical database files using NTFS permissions, using an unguessable SA password, restricting physical access to the SQL Server, renaming the Administrator account on the SQL Server computer, disabling the Guest account, enabling auditing, using multiprotocol encryption, setting up SSL, setting up firewalls, isolating SQL Server from the web server etc. Read the white paper on SQL Server security from Microsoft website. Also check out My SQL Server security best practices
What is a deadlock and what is a live lock? How will you go about resolving deadlocks?
Deadlock is a situation when two processes, each having a lock on one piece of data, attempt to acquire a lock on the other’s piece. Each process would wait indefinitely for the other to release the lock, unless one of the user processes is terminated. SQL Server detects deadlocks and terminates one user’s process. A livelock is one, where a request for an exclusive lock is repeatedly denied because a series of overlapping shared locks keeps interfering. SQL Server detects the situation after four denials and refuses further shared locks. A livelock also occurs when read transactions monopolize a table or page, forcing a write transaction to wait indefinitely. Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books online. Also check out the article Q169960 from Microsoft knowledge base.
What is blocking and how would you troubleshoot it?
Blocking happens when one connection from an application holds a lock and a second connection requires a conflicting lock type. This forces the second connection to wait, blocked on the first. Read up the following topics in SQL Server books online: Understanding and avoiding blocking, Coding efficient transactions. Explain CREATE DATABASE syntax Many of us are used to creating databases from the Enterprise Manager or by just issuing the command: CREATE DATABAE MyDB.
But what if you have to create a database with two filegroups, one on drive C and the other on drive D with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That’s why being a DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server books online for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal configuration mode?
SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has some very important parameters with which a DBA should be familiar with. -m is used for starting SQL Server in single user mode and -f is used to start the SQL Server in minimal configuration mode. Check out SQL Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for database maintenance?
DBCC CHECKDB, DBCC CHECKTABLE, DBCC CHECKCATALOG, DBCC CHECKALLOC, DBCC SHOWCONTIG, DBCC SHRINKDATABASE, DBCC SHRINKFILE etc. But there are a whole load of DBCC commands which are very useful for DBAs. Check out SQL Server books online for more information.
What are statistics, under what circumstances they go out of date, how do you update them?
Statistics determine the selectivity of the indexes. If an indexed column has unique values then the selectivity of that index is more, as opposed to an index with non-unique values. Query optimizer uses these indexes in determining whether to choose an index or not while executing a query. Some situations under which you should update statistics: 1) If there is significant change in the key values in the index 2) If a large amount of data in an indexed column has been added, changed, or removed (that is, if the distribution of key values has changed), or the table has been truncated using the TRUNCATE TABLE statement and then repopulated 3) Database is upgraded from a previous version. Look up SQL Server books online for the following commands: UPDATE STATISTICS, STATS_DATE, DBCC SHOW_STATISTICS, CREATE STATISTICS, DROP STATISTICS, sp_autostats, sp_createstats, sp_updatestats
What are the different ways of moving data/databases between servers and databases in SQL Server?
There are lots of options available, you have to choose your option depending upon your requirements. Some of the options you have are: BACKUP/RESTORE, dettaching and attaching databases, replication, DTS, BCP, logshipping, INSERT…SELECT, SELECT…INTO, creating INSERT scripts to generate data.
Explain different types of BACKUPs avaialabe in SQL Server? Given a particular scenario, how would you go about choosing a backup plan?
Types of backups you can create in SQL Sever 7.0+ are Full database backup, differential database backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE commands in SQL Server books online. Be prepared to write the commands in your interview. Books online also has information on detailed backup/restore architecture and when one should go for a particular kind of backup.
What is database replication? What are the different types of replication you can set up in SQL Server?
Replication is the process of copying/moving data between databases on the same or different servers. SQL Server supports the following types of replication scenarios: � Snapshot replication � Transactional replication (with immediate updating subscribers, with queued updating subscribers) � Merge replication See SQL Server books online for indepth coverage on replication. Be prepared to explain how different replication agents function, what are the main system tables used in replication etc.
How to determine the service pack currently installed on SQL Server?
The global variable @@Version stores the build number of the sqlservr.exe, which is used to determine the service pack installed. To know more about this process visit SQL Server service packs and versions.
What are cursors? Explain different types of cursors. What are the disadvantages of cursors? How can you avoid cursors?
Cursors allow row-by-row processing of the resultsets. Types of cursors: Static, Dynamic, Forward-only, Keyset-driven. See books online for more information. Disadvantages of cursors: Each time you fetch a row from the cursor, it results in a network roundtrip, where as a normal SELECT query makes only one roundtrip, however large the resultset is. Cursors are also costly because they require more resources and temporary storage (results in more IO operations). Further, there are restrictions on the SELECT statements that can be used with some types of cursors. Most of the times, set based operations can be used instead of cursors. Here is an example: If you have to give a flat hike to your employees using the following criteria: Salary between 30000 and 40000 — 5000 hike Salary between 40000 and 55000 — 7000 hike Salary between 55000 and 65000 — 9000 hike. In this situation many developers tend to use a cursor, determine each employee’s salary and update his salary according to the above formula. But the same can be achieved by multiple update statements or can be combined in a single UPDATE statement as shown below:
UPDATE tbl_emp SET salary = CASE WHEN salary BETWEEN 30000 AND 40000 THEN salary + 5000 WHEN salary BETWEEN 40000 AND 55000 THEN salary + 7000 WHEN salary BETWEEN 55000 AND 65000 THEN salary + 10000 END
Another situation in which developers tend to use cursors: You need to call a stored procedure when a column in a particular row meets certain condition. You don’t have to use cursors for this. This can be achieved using WHILE loop, as long as there is a unique key to identify each row. For examples of using WHILE loop for row by row processing, check out the ‘My code library’ section of my site or search for WHILE. Write down the general syntax for a SELECT statements covering all the options. Here’s the basic syntax: (Also checkout SELECT in books online for advanced syntax).
SELECT select_list [INTO new_table_] FROM table_source [WHERE search_condition] [GROUP BY group_by_expression] [HAVING search_condition] [ORDER BY order_expression [ASC DESC] ]
What is a join and explain different types of joins.
Joins are used in queries to explain how different tables are related. Joins also let you select data from a table depending upon data from another table. Types of joins: INNER JOINs, OUTER JOINs, CROSS JOINs. OUTER JOINs are further classified as LEFT OUTER JOINS, RIGHT OUTER JOINS and FULL OUTER JOINS. For more information see pages from books online titled: "Join Fundamentals" and "Using Joins".
Can you have a nested transaction?
Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT
What is an extended stored procedure? Can you instantiate a COM object by using T-SQL?
An extended stored procedure is a function within a DLL (written in a programming language like C, C++ using Open Data Services (ODS) API) that can be called from T-SQL, just the way we call normal stored procedures using the EXEC statement. See books online to learn how to create extended stored procedures and how to add them to SQL Server. Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-SQL by using sp_OACreate stored procedure. Also see books online for sp_OAMethod, sp_OAGetProperty, sp_OASetProperty, sp_OADestroy. For an example of creating a COM object in VB and calling it from T-SQL, see ‘My code library’ section of this site.
What is the system function to get the current user’s user id?
USER_ID(). Also check out other system functions like USER_NAME(), SYSTEM_USER, SESSION_USER, CURRENT_USER, USER, SUSER_SID(), HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a trigger on demand?
Triggers are special kind of stored procedures that get executed automatically when an INSERT, UPDATE or DELETE operation takes place on a table. In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one for UPDATE and one for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could create multiple triggers per each action. But in 7.0 there’s no way to control the order in which the triggers fire. In SQL Server 2000 you could specify which trigger fires first or fires last using sp_settriggerorder. Triggers can’t be invoked on demand. They get triggered only when an associated action (INSERT, UPDATE, DELETE) happens on the table on which they are defined. Triggers are generally used to implement business rules, auditing. Triggers can also be used to extend the referential integrity checks, but wherever possible, use constraints for this purpose, instead of triggers, as constraints are much faster. Till SQL Server 7.0, triggers fire only after the data modification operation happens. So in a way, they are called post triggers. But in SQL Server 2000 you could create pre triggers also. Search SQL Server 2000 books online for INSTEAD OF triggers. Also check out books online for ‘inserted table’, ‘deleted table’ and COLUMNS_UPDATED()
There is a trigger defined for INSERT operations on a table, in an OLTP system. The trigger is written to instantiate a COM object and pass the newly insterted rows to it for some custom processing. What do you think of this implementation? Can this be implemented better?
Instantiating COM objects is a time consuming process and since you are doing it from within a trigger, it slows down the data insertion process. Same is the case with sending emails from triggers. This scenario can be better implemented by logging all the necessary data into a separate table, and have a job which periodically checks this table and does the needful.
What is a self join? Explain it with an example.
Self join is just like any other join, except that two instances of the same table will be joined in the query. Here is an example: Employees table which contains rows for normal employees as well as managers. So, to find out the managers of all the employees, you need a self join.
CREATE TABLE emp ( empid int, mgrid int, empname char(10) )
INSERT emp SELECT 1,2,’Vyas’ INSERT emp SELECT 2,3,’Mohan’ INSERT emp SELECT 3,NULL,’Shobha’ INSERT emp SELECT 4,2,’Shridhar’ INSERT emp SELECT 5,2,’Sourabh’
SELECT t1.empname [Employee], t2.empname [Manager] FROM emp t1, emp t2 WHERE t1.mgrid = t2.empid Here’s an advanced query using a LEFT OUTER JOIN that even returns the employees without managers (super bosses)
SELECT t1.empname [Employee], COALESCE(t2.empname, ‘No manager’) [Manager] FROM emp t1 LEFT OUTER JOIN emp t2 ON t1.mgrid = t2.empid
Database design (8 questions)
SQL Server architecture (12 questions)
Database administration (13 questions)
Database programming (10 questions)
Database design
What is normalization? Explain different levels of normalization?
Check out the article Q100139 from Microsoft knowledge base and of course, there’s much more information available in the net. It’ll be a good idea to get a hold of any RDBMS fundamentals text book, especially the one by C. J. Date. Most of the times, it will be okay if you can explain till third normal form.
What is denormalization and when would you go for it?
As the name indicates, denormalization is the reverse process of normalization. It’s the controlled introduction of redundancy in to the database design. It helps improve the query performance as the number of joins could be reduced.
How do you implement one-to-one, one-to-many and many-to-many relationships while designing tables?
One-to-One relationship can be implemented as a single table and rarely as two tables with primary and foreign key relationships. One-to-Many relationships are implemented by splitting the data into two tables with primary key and foreign key relationships. Many-to-Many relationships are implemented using a junction table with the keys from both the tables forming the composite primary key of the junction table. It will be a good idea to read up a database designing fundamentals text book.
What’s the difference between a primary key and a unique key?
Both primary key and unique enforce uniqueness of the column on which they are defined. But by default primary key creates a clustered index on the column, where are unique creates a nonclustered index by default. Another major difference is that, primary key doesn’t allow NULLs, but unique key allows one NULL only.
What are user defined datatypes and when you should go for them?
User defined datatypes let you extend the base SQL Server datatypes by providing a descriptive name, and format to the database. Take for example, in your database, there is a column called Flight_Num which appears in many tables. In all these tables it should be varchar(8). In this case you could create a user defined datatype called Flight_num_type of varchar(8) and use it across all your tables. See sp_addtype, sp_droptype in books online.
What is bit datatype and what’s the information that can be stored inside a bit column?
Bit datatype is used to store boolean information like 1 or 0 (true or false). Untill SQL Server 6.5 bit datatype could hold either a 1 or 0 and there was no support for NULL. But from SQL Server 7.0 onwards, bit datatype can represent a third state, which is NULL.
Define candidate key, alternate key, composite key.
A candidate key is one that can identify each row of a table uniquely. Generally a candidate key becomes the primary key of the table. If the table has more than one candidate key, one of them will become the primary key, and the rest are called alternate keys. A key formed by combining at least two or more columns is called composite key.
What are defaults? Is there a column to which a default can’t be bound?
A default is a value that will be used by a column, if no value is supplied to that column while inserting data. IDENTITY columns and timestamp columns can’t have defaults bound to them. See CREATE DEFAULT in books online.
What is a transaction and what are ACID properties?
A transaction is a logical unit of work in which, all the steps must be performed or none. ACID stands for Atomicity, Consistency, Isolation, Durability. These are the properties of a transaction. For more information and explanation of these properties, see SQL Server books online or any RDBMS fundamentals text book. Explain different isolation levels An isolation level determines the degree of isolation of data between concurrent transactions. The default SQL Server isolation level is Read Committed. Here are the other isolation levels (in the ascending order of isolation): Read Uncommitted, Read Committed, Repeatable Read, Serializable. See SQL Server books online for an explanation of the isolation levels. Be sure to read about SET TRANSACTION ISOLATION LEVEL, which lets you customize the isolation level at the connection level. Read Committed - A transaction operating at the Read Committed level cannot see changes made by other transactions until those transactions are committed. At this level of isolation, dirty reads are not possible but nonrepeatable reads and phantoms are possible. Read Uncommitted - A transaction operating at the Read Uncommitted level can see uncommitted changes made by other transactions. At this level of isolation, dirty reads, nonrepeatable reads, and phantoms are all possible. Repeatable Read - A transaction operating at the Repeatable Read level is guaranteed not to see any changes made by other transactions in values it has already read. At this level of isolation, dirty reads and nonrepeatable reads are not possible but phantoms are possible. Serializable - A transaction operating at the Serializable level guarantees that all concurrent transactions interact only in ways that produce the same effect as if each transaction were entirely executed one after the other. At this isolation level, dirty reads, nonrepeatable reads, and phantoms are not possible.
CREATE INDEX myIndex ON myTable(myColumn)What type of Index will get created after executing the above statement?
Non-clustered index. Important thing to note: By default a clustered index gets created on the primary key, unless specified otherwise.
What’s the maximum size of a row?
8060 bytes. Don’t be surprised with questions like ‘what is the maximum number of columns per table’. 1024 columns per table. Check out SQL Server books online for the page titled: "Maximum Capacity Specifications". Explain Active/Active and Active/Passive cluster configurations Hopefully you have experience setting up cluster servers. But if you don’t, at least be familiar with the way clustering works and the two clusterning configurations Active/Active and Active/Passive. SQL Server books online has enough information on this topic and there is a good white paper available on Microsoft site. Explain the architecture of SQL Server This is a very important question and you better be able to answer it if consider yourself a DBA. SQL Server books online is the best place to read about SQL Server architecture. Read up the chapter dedicated to SQL Server Architecture.
What is lock escalation?
Lock escalation is the process of converting a lot of low level locks (like row locks, page locks) into higher level locks (like table locks). Every lock is a memory structure too many locks would mean, more memory being occupied by locks. To prevent this from happening, SQL Server escalates the many fine-grain locks to fewer coarse-grain locks. Lock escalation threshold was definable in SQL Server 6.5, but from SQL Server 7.0 onwards it’s dynamically managed by SQL Server.
What’s the difference between DELETE TABLE and TRUNCATE TABLE commands?
DELETE TABLE is a logged operation, so the deletion of each row gets logged in the transaction log, which makes it slow. TRUNCATE TABLE also deletes all the rows in a table, but it won’t log the deletion of each row, instead it logs the deallocation of the data pages of the table, which makes it faster. Of course, TRUNCATE TABLE can be rolled back. TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE. The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table’s data, and only the page deallocations are recorded in the transaction log. TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement. You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger. TRUNCATE TABLE may not be used on tables participating in an indexed view
Explain the storage models of OLAP
Check out MOLAP, ROLAP and HOLAP in SQL Server books online for more infomation.
What are the new features introduced in SQL Server 2000 (or the latest release of SQL Server at the time of your interview)? What changed between the previous version of SQL Server and the current version?
This question is generally asked to see how current is your knowledge. Generally there is a section in the beginning of the books online titled "What’s New", which has all such information. Of course, reading just that is not enough, you should have tried those things to better answer the questions. Also check out the section titled "Backward Compatibility" in books online which talks about the changes that have taken place in the new version.
What are constraints? Explain different types of constraints.
Constraints enable the RDBMS enforce the integrity of the database automatically, without needing you to create triggers, rule or defaults. Types of constraints: NOT NULL, CHECK, UNIQUE, PRIMARY KEY, FOREIGN KEY. For an explanation of these constraints see books online for the pages titled: "Constraints" and "CREATE TABLE", "ALTER TABLE"
What is an index? What are the types of indexes? How many clustered indexes can be created on a table? I create a separate index on each column of a table. What are the advantages and disadvantages of this approach?
Indexes in SQL Server are similar to the indexes in books. They help SQL Server retrieve the data quicker. Indexes are of two types. Clustered indexes and non-clustered indexes. When you create a clustered index on a table, all the rows in the table are stored in the order of the clustered index key. So, there can be only one clustered index per table. Non-clustered indexes have their own storage separate from the table data storage. Non-clustered indexes are stored as B-tree structures (so do clustered indexes), with the leaf level nodes having the index key and it’s row locater. The row located could be the RID or the Clustered index key, depending up on the absence or presence of clustered index on the table. If you create an index on each column of a table, it improves the query performance, as the query optimizer can choose from all the existing indexes to come up with an efficient execution plan. At the same t ime, data modification operations (such as INSERT, UPDATE, DELETE) will become slow, as every time data changes in the table, all the indexes need to be updated. Another disadvantage is that, indexes need disk space, the more indexes you have, more disk space is used.
What is RAID and what are different types of RAID configurations?
RAID stands for Redundant Array of Inexpensive Disks, used to provide fault tolerance to database servers. There are six RAID levels 0 through 5 offering different levels of performance, fault tolerance. MSDN has some information about RAID levels and for detailed information, check out the RAID advisory board’s homepage
What are the steps you will take to improve performance of a poor performing query?
This is a very open ended question and there could be a lot of reasons behind the poor performance of a query. But some general issues that you could talk about would be: No indexes, table scans, missing or out of date statistics, blocking, excess recompilations of stored procedures, procedures and triggers without SET NOCOUNT ON, poorly written query with unnecessarily complicated joins, too much normalization, excess usage of cursors and temporary tables. Some of the tools/ways that help you troubleshooting performance problems are: SET SHOWPLAN_ALL ON, SET SHOWPLAN_TEXT ON, SET STATISTICS IO ON, SQL Server Profiler, Windows NT /2000 Performance monitor, Graphical execution plan in Query Analyzer. Download the white paper on performance tuning SQL Server from Microsoft web site. Don’t forget to check out sql-server-performance.com
What are the steps you will take, if you are tasked with securing an SQL Server?
Again this is another open ended question. Here are some things you could talk about: Preferring NT authentication, using server, databse and application roles to control access to the data, securing the physical database files using NTFS permissions, using an unguessable SA password, restricting physical access to the SQL Server, renaming the Administrator account on the SQL Server computer, disabling the Guest account, enabling auditing, using multiprotocol encryption, setting up SSL, setting up firewalls, isolating SQL Server from the web server etc. Read the white paper on SQL Server security from Microsoft website. Also check out My SQL Server security best practices
What is a deadlock and what is a live lock? How will you go about resolving deadlocks?
Deadlock is a situation when two processes, each having a lock on one piece of data, attempt to acquire a lock on the other’s piece. Each process would wait indefinitely for the other to release the lock, unless one of the user processes is terminated. SQL Server detects deadlocks and terminates one user’s process. A livelock is one, where a request for an exclusive lock is repeatedly denied because a series of overlapping shared locks keeps interfering. SQL Server detects the situation after four denials and refuses further shared locks. A livelock also occurs when read transactions monopolize a table or page, forcing a write transaction to wait indefinitely. Check out SET DEADLOCK_PRIORITY and "Minimizing Deadlocks" in SQL Server books online. Also check out the article Q169960 from Microsoft knowledge base.
What is blocking and how would you troubleshoot it?
Blocking happens when one connection from an application holds a lock and a second connection requires a conflicting lock type. This forces the second connection to wait, blocked on the first. Read up the following topics in SQL Server books online: Understanding and avoiding blocking, Coding efficient transactions. Explain CREATE DATABASE syntax Many of us are used to creating databases from the Enterprise Manager or by just issuing the command: CREATE DATABAE MyDB.
But what if you have to create a database with two filegroups, one on drive C and the other on drive D with log on drive E with an initial size of 600 MB and with a growth factor of 15%?
That’s why being a DBA you should be familiar with the CREATE DATABASE syntax. Check out SQL Server books online for more information.
How to restart SQL Server in single user mode? How to start SQL Server in minimal configuration mode?
SQL Server can be started from command line, using the SQLSERVR.EXE. This EXE has some very important parameters with which a DBA should be familiar with. -m is used for starting SQL Server in single user mode and -f is used to start the SQL Server in minimal configuration mode. Check out SQL Server books online for more parameters and their explanations.
As a part of your job, what are the DBCC commands that you commonly use for database maintenance?
DBCC CHECKDB, DBCC CHECKTABLE, DBCC CHECKCATALOG, DBCC CHECKALLOC, DBCC SHOWCONTIG, DBCC SHRINKDATABASE, DBCC SHRINKFILE etc. But there are a whole load of DBCC commands which are very useful for DBAs. Check out SQL Server books online for more information.
What are statistics, under what circumstances they go out of date, how do you update them?
Statistics determine the selectivity of the indexes. If an indexed column has unique values then the selectivity of that index is more, as opposed to an index with non-unique values. Query optimizer uses these indexes in determining whether to choose an index or not while executing a query. Some situations under which you should update statistics: 1) If there is significant change in the key values in the index 2) If a large amount of data in an indexed column has been added, changed, or removed (that is, if the distribution of key values has changed), or the table has been truncated using the TRUNCATE TABLE statement and then repopulated 3) Database is upgraded from a previous version. Look up SQL Server books online for the following commands: UPDATE STATISTICS, STATS_DATE, DBCC SHOW_STATISTICS, CREATE STATISTICS, DROP STATISTICS, sp_autostats, sp_createstats, sp_updatestats
What are the different ways of moving data/databases between servers and databases in SQL Server?
There are lots of options available, you have to choose your option depending upon your requirements. Some of the options you have are: BACKUP/RESTORE, dettaching and attaching databases, replication, DTS, BCP, logshipping, INSERT…SELECT, SELECT…INTO, creating INSERT scripts to generate data.
Explain different types of BACKUPs avaialabe in SQL Server? Given a particular scenario, how would you go about choosing a backup plan?
Types of backups you can create in SQL Sever 7.0+ are Full database backup, differential database backup, transaction log backup, filegroup backup. Check out the BACKUP and RESTORE commands in SQL Server books online. Be prepared to write the commands in your interview. Books online also has information on detailed backup/restore architecture and when one should go for a particular kind of backup.
What is database replication? What are the different types of replication you can set up in SQL Server?
Replication is the process of copying/moving data between databases on the same or different servers. SQL Server supports the following types of replication scenarios: � Snapshot replication � Transactional replication (with immediate updating subscribers, with queued updating subscribers) � Merge replication See SQL Server books online for indepth coverage on replication. Be prepared to explain how different replication agents function, what are the main system tables used in replication etc.
How to determine the service pack currently installed on SQL Server?
The global variable @@Version stores the build number of the sqlservr.exe, which is used to determine the service pack installed. To know more about this process visit SQL Server service packs and versions.
What are cursors? Explain different types of cursors. What are the disadvantages of cursors? How can you avoid cursors?
Cursors allow row-by-row processing of the resultsets. Types of cursors: Static, Dynamic, Forward-only, Keyset-driven. See books online for more information. Disadvantages of cursors: Each time you fetch a row from the cursor, it results in a network roundtrip, where as a normal SELECT query makes only one roundtrip, however large the resultset is. Cursors are also costly because they require more resources and temporary storage (results in more IO operations). Further, there are restrictions on the SELECT statements that can be used with some types of cursors. Most of the times, set based operations can be used instead of cursors. Here is an example: If you have to give a flat hike to your employees using the following criteria: Salary between 30000 and 40000 — 5000 hike Salary between 40000 and 55000 — 7000 hike Salary between 55000 and 65000 — 9000 hike. In this situation many developers tend to use a cursor, determine each employee’s salary and update his salary according to the above formula. But the same can be achieved by multiple update statements or can be combined in a single UPDATE statement as shown below:
UPDATE tbl_emp SET salary = CASE WHEN salary BETWEEN 30000 AND 40000 THEN salary + 5000 WHEN salary BETWEEN 40000 AND 55000 THEN salary + 7000 WHEN salary BETWEEN 55000 AND 65000 THEN salary + 10000 END
Another situation in which developers tend to use cursors: You need to call a stored procedure when a column in a particular row meets certain condition. You don’t have to use cursors for this. This can be achieved using WHILE loop, as long as there is a unique key to identify each row. For examples of using WHILE loop for row by row processing, check out the ‘My code library’ section of my site or search for WHILE. Write down the general syntax for a SELECT statements covering all the options. Here’s the basic syntax: (Also checkout SELECT in books online for advanced syntax).
SELECT select_list [INTO new_table_] FROM table_source [WHERE search_condition] [GROUP BY group_by_expression] [HAVING search_condition] [ORDER BY order_expression [ASC DESC] ]
What is a join and explain different types of joins.
Joins are used in queries to explain how different tables are related. Joins also let you select data from a table depending upon data from another table. Types of joins: INNER JOINs, OUTER JOINs, CROSS JOINs. OUTER JOINs are further classified as LEFT OUTER JOINS, RIGHT OUTER JOINS and FULL OUTER JOINS. For more information see pages from books online titled: "Join Fundamentals" and "Using Joins".
Can you have a nested transaction?
Yes, very much. Check out BEGIN TRAN, COMMIT, ROLLBACK, SAVE TRAN and @@TRANCOUNT
What is an extended stored procedure? Can you instantiate a COM object by using T-SQL?
An extended stored procedure is a function within a DLL (written in a programming language like C, C++ using Open Data Services (ODS) API) that can be called from T-SQL, just the way we call normal stored procedures using the EXEC statement. See books online to learn how to create extended stored procedures and how to add them to SQL Server. Yes, you can instantiate a COM (written in languages like VB, VC++) object from T-SQL by using sp_OACreate stored procedure. Also see books online for sp_OAMethod, sp_OAGetProperty, sp_OASetProperty, sp_OADestroy. For an example of creating a COM object in VB and calling it from T-SQL, see ‘My code library’ section of this site.
What is the system function to get the current user’s user id?
USER_ID(). Also check out other system functions like USER_NAME(), SYSTEM_USER, SESSION_USER, CURRENT_USER, USER, SUSER_SID(), HOST_NAME().
What are triggers? How many triggers you can have on a table? How to invoke a trigger on demand?
Triggers are special kind of stored procedures that get executed automatically when an INSERT, UPDATE or DELETE operation takes place on a table. In SQL Server 6.5 you could define only 3 triggers per table, one for INSERT, one for UPDATE and one for DELETE. From SQL Server 7.0 onwards, this restriction is gone, and you could create multiple triggers per each action. But in 7.0 there’s no way to control the order in which the triggers fire. In SQL Server 2000 you could specify which trigger fires first or fires last using sp_settriggerorder. Triggers can’t be invoked on demand. They get triggered only when an associated action (INSERT, UPDATE, DELETE) happens on the table on which they are defined. Triggers are generally used to implement business rules, auditing. Triggers can also be used to extend the referential integrity checks, but wherever possible, use constraints for this purpose, instead of triggers, as constraints are much faster. Till SQL Server 7.0, triggers fire only after the data modification operation happens. So in a way, they are called post triggers. But in SQL Server 2000 you could create pre triggers also. Search SQL Server 2000 books online for INSTEAD OF triggers. Also check out books online for ‘inserted table’, ‘deleted table’ and COLUMNS_UPDATED()
There is a trigger defined for INSERT operations on a table, in an OLTP system. The trigger is written to instantiate a COM object and pass the newly insterted rows to it for some custom processing. What do you think of this implementation? Can this be implemented better?
Instantiating COM objects is a time consuming process and since you are doing it from within a trigger, it slows down the data insertion process. Same is the case with sending emails from triggers. This scenario can be better implemented by logging all the necessary data into a separate table, and have a job which periodically checks this table and does the needful.
What is a self join? Explain it with an example.
Self join is just like any other join, except that two instances of the same table will be joined in the query. Here is an example: Employees table which contains rows for normal employees as well as managers. So, to find out the managers of all the employees, you need a self join.
CREATE TABLE emp ( empid int, mgrid int, empname char(10) )
INSERT emp SELECT 1,2,’Vyas’ INSERT emp SELECT 2,3,’Mohan’ INSERT emp SELECT 3,NULL,’Shobha’ INSERT emp SELECT 4,2,’Shridhar’ INSERT emp SELECT 5,2,’Sourabh’
SELECT t1.empname [Employee], t2.empname [Manager] FROM emp t1, emp t2 WHERE t1.mgrid = t2.empid Here’s an advanced query using a LEFT OUTER JOIN that even returns the employees without managers (super bosses)
SELECT t1.empname [Employee], COALESCE(t2.empname, ‘No manager’) [Manager] FROM emp t1 LEFT OUTER JOIN emp t2 ON t1.mgrid = t2.empid
Subscribe to:
Posts (Atom)

