'컴퓨터공학/DataBase' 카테고리의 글 목록

컴퓨터공학/DataBase

NOSQL 2013.01.21
Relational Model의 주요 개념 2013.01.14
Database 이론 2013.01.14

NOSQL

2013. 1. 21. 14:20

ONE SIZE DOES NOT FIT ALL

전통적인 RDBMS
– 관계형 데이터 모델, ER-Diagram, SQL – 스키마, 정규화, 데이터 무결성 ...
– 트랜잭션, ACID 속성 ...
– 병행처리(Concurrency control), 2PLP... – 회복, 로그 ...
전통적 DBMS의 문제점
– Scalability: 오라클을 10,000대에 설치/관리할 수 있나??
– Performance: 오라클에서 초당 만건 이상의 변경을 처리할 수 있나?
– Schema: 정형화된 스키마가 없으면?
– Reliability는 필요 없으니 더 빠를 수는 없나?
– Persistent는 필요 없으니 더 쉬울 수는 없나?
– 복잡한데이터모델은필요없으니더간단할수없나?

NoSQL

• "Not only SQL"

• 전통적인 RDBMS와는 다른 종류의 새로운 데이터 저장/ 관리 시스템

NoSQL의 장/단점

• 장점
– 유연한 스키마
– 쉽고 빠른 설치/관리
– Massive Scalability
– 완화된 일관성(consistency)àHigh Performance & Availability

• 단점
– SQL과같은표준질의언어부족à프로그래밍모델도입

– 완화된 일관성àACID가 보장 안됨

NoSQL 종류

• MapReduce / Hadoop

– 대용량 데이터 분석 (OLAP)
– Google: MapReduce + Google File System

– Apache: Hadoop + Hadoop Distributed File System

• Key-Value

– High Performance 병렬 해시
– 잦은 변경, 빠른 반응 속도 (OLTP)

– Dynamo (Amazon), Cassandra (Facebook/Apache), BigTable (Google), HBase (Facebook/Apache)

• Document

– Key + 문서 (XML, JSON 과 같은 반구조화/비구조화 자료구조) – MongoDB, CouchDB (Apache), SimpleDB (Amazon)

• Graph
– Node/Edge, RDF, Semantic Web – 예: Neo4j, Allegro

MapReduce

• Map: 문제를 sub-problem으로 나누어 분산 해결

– map(item)à<key, value>

• Reduce: sub-problem 별로 결과를 취합

– reduce(key, <list of values>)àvalue

• 예)책한권에서단어별출연빈도수세기

– Map: 아이들 각각에게 한 페이지씩 나누어주고, <단어, 빈도수>를 각각

포스트잇에 적어 보고하도록 함.
– Reduce: A/B/C/D/E... 별로 포스트잇을 취합하여 각각 숫자를 더함

MapReduce 프레임워크

• Apache Hive (Facebook/Netflix) – Hadoop + Data warehouse
– HiveQL(SQL 유사 질의 언어) 지원

• Apache Pig (Yahoo)
– Pig Latin: High-level language for Hadoop

• Apache ZooKeeper

– 분산환경에서설정공유,이벤트처리,분산관리등
– open source centralized configuration service and naming registry for large distributed systems

Key/Value

• 단순한 데이터 모델: (key, value)

• 단순한 연산: put, get, update, delete

• 장점

– Efficiency: 빠른 처리 속도

– Scalability: 필요에 따른 손쉬운 서버 확장

– Fault-tolerance: 데이터 복제

DHT: Distributed Hash Tables

• Hash
– put(key, value)
– valueßget(key)

• Distributed
– 임의의 노드에 분산 저장
– 노드의 추가/삭제가 자유로움

• 대표적인구현방법 – Chord(1):

• Ring형태 구성
• m비트의 키와 노드 ID사용
• O(log N)의 라우팅 테이블 크기
• O(logN)홉만에데이터도달보장

BSP (Bulk Synchronous Parallel)

• 병렬계산을위한계산모델

Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010 by Google
Apache Hama: http://hama.apache.org/
Apache Giraph: http://incubator.apache.org/giraph/

NewSQL (?)

VoltDB by Michael Stonebraker
– Automatic partitioning across a shared nothing server cluster
– Main memory data architecture
– Elimination of multi-threading and locking overhead
– Built-in High Availability using synchronous multi-master replication
– Review of VoltDB's stored procedure interface
Vertica (HP) by Michael Stonebraker – Real-Time Loading & Querying
– Advanced In-Database Analytics
– Columnar Storage & Execution

– Aggressive Data Compression
– Scale-Out MPP Architecture
– Automatic High Availability
– Optimizer, Execution Engine & Workload Management – Native BI, ETL, & Hadoop/MapReduce Integration

분석 전용 DBMS

• 기존DBMS • 분석전용DBMS
– Concurrency Control – No concurrency
– Row-기반 저장구조 – Column-기반 저장구조 – 중앙집중서버 – 병렬/분산처리
– 고수준의 트랜잭션 레벨 – 약화된 트랜잭션 레벨 – 디스크기반 – 메인메모리기반

참고: Key Features of EXASOL

Relational database management system
Standard hardware cluster
In-memory query processing
Massively parallel data processing
Column by column storage
Intelligent and innovative compression algorithms
Self-learning and self-optimising system
Simple integration thanks to standard interfaces

Column-Oriented DBMS

• 컬럼순서로데이터저장

고객ID	이름	주소	나이	전화번호
CE1	박현민	서울	24	....
CE2	이강선	대전	20	....
CE3	홍길동	서울	18	....

• 이점
    – 컬럼단위의대한분석에용이
    – 압축이 효과적à메모리 기반 처리 용이
    – 병렬 처리에 유리 (SIMD: Single Instruction, Multiple Data)

저작자표시

'컴퓨터공학 > DataBase' 카테고리의 다른 글

Relational Model의 주요 개념 (0)	2013.01.14
Database 이론 (0)	2013.01.14

Relational Model의 주요 개념

2013. 1. 14. 16:45

Domain (type): Attribute가 가질 수 있는 값의 집합
Attribute (column)
Tuple (row, record): set of values for attributes
Relation (table): set of tuples

• Database: set of relations

Schema & Instance

• Schema
– the logical structure of the database
– type information of a variable in a program
– Physical schema: database design at the physical level – Logical schema: database design at the logical level

• Instance
– actual contents at a particular point in time – the value of a variable

NULL

• special value for “unknown” or “undefined” • 숫자0,빈문자열“”등과는다름
• 모든 Domain은 NULL값을 포함 함

Key

• Key: Tuple을 구별하기 위한 Attribute 집합 – Relation은 동일한 tuple이 있을 수 없음

• Superkey (수퍼키)
– Relation에서 Tuple을 식별할 수 있는 Unique한 Attribute의 집합

Candidate Key (후보키)
- – Superkey 중에서 Minimal 한 Key
- – Minimal: 하나의 Attribute라도 빼면 더 이상 Key가 아님
Primary Key (기본키, PK)
- – Candidate Key 중 하나 (Relation을 정의할 때 선택)
- – Entity Integrity : NULL이 될 수 없음
Foreign Key (참조키, FK)
- – 타 relation을 참조하는 attribute
- – 참조하는 relation에서 key는 아니지만, 참조되는 relation에서 primary key임.
- – Referential Integrity: 반드시 참조된 relation의 PK 값에 존재하거나 NULL이어야 함

저작자표시

'컴퓨터공학 > DataBase' 카테고리의 다른 글

NOSQL (0)	2013.01.21
Database 이론 (0)	2013.01.14

Database 이론

2013. 1. 14. 14:24

DATABASE ?

- 한 조직의 여러 응용 시스템들이 공용(Shared)하기 위해 통합(Integrated), 저장(Stored)한 운영 데이타(Operational data) 의 집합

DBMS란?

DB관리를 위한 컴퓨터 시스템
– 전사적인 정보 관리
– 관련된 데이터의 집합
– 데이터에 접근하는 프로그램 집합 – 효율적이고 편한 사용을 위한 환경
DBMS 응용의 예:
– Banking: all transactions
– Airlines: reservations, schedules
– Universities: registration, grades
– Sales: customers, products, purchases
– Online retailers: order tracking, customized recommendations – Manufacturing: production, inventory, orders, supply chain
– Human resources: employee records, salaries, tax deductions

DBMS특징

- Massive

- Persistent

- Safe

- Multi-user

- Convenient

- Efficient

- Reliable

Key People

-DBMS implementer

-Database designer

-Database application developer

-Database administrator

파일 시스템의 문제점

데이터의 중복(Redundancy) 와 일관성(Consistency) 문제
– Multiple file formats, duplication of information in different files
데이터 접근의 어려움
– 각 작업마다 별도의 프로그램 작성 – 각각별도의방법이필요할수있음
데이터 종속성 (Dependency)
– 데이터의 포맷이나 접근 방법 등이 프로그램 코드에 종속됨. – 프로그램의 변경이나 데이터 형태, 종류 등의 변경이 불가능
데이터 독립성 (Isolation)
– 여러 프로그램에서 동시에 데이터를 수정하면?
– 하나의수정작업이다른작업에영향을줄수있음

변경의 원자성(Atomicity) 문제
- – 일련의 작업 중 시스템의 failure가 발생하면??
- – 예)계좌이체중 내계좌에서돈이나갔는데,다른계좌에가기전에정전이일 어난다면?
동시 사용성(Concurrency) 제어 문제
- – 동시에 일련의 작업들이 이루어질 경우 올바른 수행을 보장할 수 있는가? (일관
  
  성에 문제)
- – 예) 두 명이 동시에 한계좌에서 돈을 인출하려고 하면?
데이터 무결성 (Integrity) 문제
– Integrity constraints (예. account balance > 0) 가 프로그램 코드 속에 기술

• 프로그램코드를복잡하게만들고유지보수를어렵게함

– 제약조건 변경이나 추가 등이 힘들다.

• 보안
– 보안을 보장하기 힘듬: 다양한 파일, 다양한 접근 경로, 다양한 프로그램의 이용

• DBMS 벤치마크 사이트: http://www.tpc.org – 데이터베이스 시스템의 성능 벤치마크
– DBMS + H/W System + Middle Ware System ... – 분류

• TPC-C: 트랜잭션 시스템 (OLTP)
– 2013년 1월 현재 1위: 100억원, 분당 3000만 트랜잭션

• TPC-H: 의사결정 시스템 (OLAP)

저작자표시

'컴퓨터공학 > DataBase' 카테고리의 다른 글

NOSQL (0)	2013.01.21
Relational Model의 주요 개념 (0)	2013.01.14

PREV 1 NEXT

음하하하