Welcome back to Peking University MOOC Bioinformatics: Introduction and Methods.
This video is supplementary learning materials 1: brief introduction to databases.
I’m Chen Xie from National Institute of Biological Sciences, Beijing.
Database is the collection of data.
Database management system (DBMS) is the collection of interrelated data and a set of programs to access those data.
DBMS provides a efficient, reliable, convenient and safe multi-user storage of and access to massive amounts of persistent data.
At first, the main way to keep the information on a computer is to store it in files.
However, file-processing system has lots of disadvantages: data redundancy and inconsistency;
difficulty in accessing data, data isolation, difficulty in satisfying consistency constraints,
difficulty in ensuring database consistency, concurrent access by multiple users and security problems.
DBMS can solve all the above problems.
Underlying the structure of a database is the data model,
which is a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.
The data models includes four different categories: relational model, entity-relationship model,
object-based data model, and semistructured data model.
The relational model is the most widely used data model, and will be further introduced in the next slides.
In relational model, database is a set of named relations or tables; Each relation has a set of named attributes, or columns;
Each tuple, or row, has a value for each attribute;
Each attribute has a type or domain.
This is an example for relational model. Table Pathways is shown, which contains four columns: pid, db, id, name,
which represent the pathway ID in back-end database, the name of the original database, and the ID and name in that database separately.
Each row represents the information of a pathway.
For example, the first row means a pathway, whose pid is 21, is from KEGG PATHWAY
with ID hsa03013 and name RNA transport. Each column has a specific type. For example, the type of the first column is INTEGER.
An important concept in relational model is the key.
A key can be an column whose value is unique in each row. For example, pid in Table Pathways.
A key can also be a set of columns whose combined values are unique. For example, (pid, gid) in Table PathwayGenes.
We use keys to identify specific rows, build index on key for efficiency, and refer a table to another one in the database.
When it comes to reference between tables, we wish to ensure that a value that appears in one table for a given set of columns
also appears in a certain set of columns in another table, which is called referential integrity.
For example, the pid value in Table PathwayGenes must appear in the pid column of Table Pathways.
Database modifications can cause violations of referential integrity.
When a referential-integrity constraint is violated, the normal procedure is to reject the action that caused the violation
A database system provides a data-definition language to
specify the database schema and a data-manipulation language to express database queries and updates.
In practice, the data-definition and data-manipulation languages are not two separate languages;
instead they simply form parts of a single database language, such as the widely used SQL language.
SQL provides a rich DDL that allows one to define tables, integrity constraints, assertions, etc.
For instance, the SQL DDL statement on this slide defines table Pathways.
The SQL query language is nonprocedural. A query takes as input several tables and always returns a single table.
Here are two examples of SQL queries
that find the name of the pathway with pid 21 and the name of all pathways having the gene with gid hsa:1017
Main open source database softwares include MySQL, SQLite, and PostgreSQL.
This video is only a brief introduction of basic concepts in database,
and the references are Database System Concepts, and a great MOOC Introduction to Databases.
If you are interested in further learning database, please check these two resources. Thank you very much!