苏州至达州汽车:Perspectives - Google Megastore: The Data Eng...

来源:百度文库 编辑:九乡新闻网 时间:2024/04/19 14:36:30
Google Megastore: The Data Engine Behind GAE

Megastore is the data engine supporting the Google Application Engine. It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions.

 

I wrote up some notes on it back in 2008 Under the Covers of the App Engine Datastore and posted Phil Bernstein’s excellent notes from a 2008 SIGMOD talk: Google Megastore. But there has been remarkably little written about this datastore over the intervening couple of years until this year’s CIDR conference papers were posted. CIDR 2011 includes Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

 

My rough notes from the paper:

·         Megastore is built upon BigTable

·         Bigtable supports fault-tolerant storage within a single datacenter

·         Synchronous replication based upon Paxos and optimized for long distance inter-datacenter links

·         Partitioned into a vast space of small databases each with its own replicated log

·         Each log stored across a Paxos cluster

·         Because they are so aggressively partitioned, each Paxos group only has to accept logs for operations on a small partition. However, the design does serialize updates on each partition

·         3 billion writes and 20 billion read transactions per day

·         Support for consistency unusual for a NoSQL database but driven by (what I believe to be) the correct belief that inconsistent updates make many applications difficult to write (see I Love Eventual Consistency but …)

·         Data Model:

·         The data model is declared in a strongly typed schema

·         There are potentially many tables per schema

·         There are potentially many entities per table

·         There are potentially many strongly typed properties per entity

·         Repeating properties are allowed

·         Tables can be arranged hierarchically where child tables point to root tables

·         Megastore tables are either entity group root tables or child tables

·         The root table and all child tables are stored in the same entity group

·         Secondary indexes are supported

·         Local secondary indexes index a specific entity group and are maintained consistently

·         Global secondary indexes index across entity groups are asynchronously updates and eventually consistent

·         Repeated indexes: supports indexing repeated values (e.g. photo tags)

·         Inline indexes provide a way to denormalize data from source entities into a related target entity as a virtual repeated column.

·         There are physical storage hints:

·         “IN TABLE” directs Megastore to store two tables in the same underlying BigTable

·         “SCATTER” attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically increasing values like dates (e.g. a history table).

·         “STORING” clause on an index supports index-only-access by redundantly storing additional data in an index. This avoids the double access often required of doing a secondary index lookup to find the correct entity and then selecting the correct properties from that entity through a second table access. By pulling values up into the secondary index, the base table doesn’t need to be accessed to obtain these properties.

·         3 levels of read consistency:

·         Current: Last committed value

·         Snapshot: Value as of start of the read transaction

·         Inconsistent reads: used for cross entity group reads

·         Update path:

·         Transaction writes its mutations to the entity groups write-ahead log and then apply the mutations to the data (write ahead logging).

·         Write transaction always begins with a current read to determine the next available log position. The commit operation gathers mutations into a log entry, assigns an increasing timestamp, and appends to log which is maintained using paxos.

·         Update rates within a entity group are seriously limited by:

·         When there is log contention, one wins and the rest fail and must be retried.

·         Paxos only accepts a very limited update rate (order 10^2 updates per second).

·         Paper reports that “limiting updates within an entity group to a few writes per second per entity group yields insignificant write conflicts”

·         Implication: programmers must shard aggressively to get even moderate update rates and consistent update across shards is only supported using two phase commit which is not recommended.

·         Cross entity group updates are supported by:

·         Two-phase commit with the fragility that it brings

·         Queueing and asynchronously applying the changes

·         Excellent support for backup and redundancy:

·         Synchronous replication to protect against media failure

·         Snapshots and incremental log backups

 

Overall, an excellent paper with lots of detail on a nicely executed storage system. Supporting consistent read and full ACID update semantics is impressive although the limitation of not being able to update an entity group at more than a “few per second” is limiting.

 

The paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.

 

Thanks to Zhu Han, Reto Kramer, and Chris Newcombe for all sending this paper my way.