libfoedus-core
FOEDUS Core Library
foedus::storage::masstree::MasstreePartitioner Class Referencefinal

Partitioner for a masstree storage. More...

Detailed Description

Partitioner for a masstree storage.

The partitioner for masstree simply designs n pairs of a low-key and node.

Low-key
Low-key marks the inclusive beginning of the partition with arbitrary length. For example,
  • (low-key: zero-length string, node=0)
  • (low-key: "e", node=1)
  • (low-key: "k", node=2)
  • (low-key: "r", node=1)
  • (low-key: "w", node=2)
This means there are 5 partitions placed in 3 nodes. "abc" would be in node 0, "z" would be in node 2, etc. We so far don't do KeySlice optimization used in the transactional-processing side. We might do that later, but let's see if log gleaner is fast enough without fancy optimization.
Design policy
Not suprisingly, the partition design logic is more complex than array and sequential. But, we still want to make partitioner/composer as simple as possible. The compromise is as follows.
One slice as partition key
For efficiency and simplicity we always use one KeySlice (8-bytes) as low-key. We initially explored arbitrary length partitiong keys, but it causes too much overhead and complexity. By using one KeySlice, all comparison is just an integer comparison and fixed-sized.
Enumerating partitions
With this simplification, partition design just checks the root page of first layer. If it's a border page, everything to node 0. If it's an intermediate page, we use separators in it as partitions. Stupidly simple. For each pointer, we determine the owner node simply based on the last updater of the pointed page, which is a rough statistics in the page header (no correctness guaranteed).
Expected issues
The scheme above is so simple and easy to implement/maintain. Of course the simplicity has its price. If all keys start with a common 8-bytes, we are screwed. If pointer-distributions are skewed, eg one pointer leads to billion records while other pointers lead to just one record, it will result in one node receiving almost all records. Nevertheless, the branch-page/partition framework above is flexible enough to address the issue later. We just need advanced algorithm to enumerate branch pages and determine owners. Let's keep it simple for now, and work on this later.
Note
This is a private implementation-details of Masstree Storage, thus file name ends with _impl. Do not include this header from a client program. There is no case client program needs to access this internal class.

Definition at line 98 of file masstree_partitioner_impl.hpp.

#include <masstree_partitioner_impl.hpp>

Public Member Functions

 MasstreePartitioner (Partitioner *parent)
 MasstreePartitioner methods. More...
 
ErrorStack design_partition (const Partitioner::DesignPartitionArguments &args)
 
bool is_partitionable () const
 
void partition_batch (const Partitioner::PartitionBatchArguments &args) const
 
void sort_batch (const Partitioner::SortBatchArguments &args) const
 

Friends

std::ostream & operator<< (std::ostream &o, const MasstreePartitioner &v)
 

Constructor & Destructor Documentation

foedus::storage::masstree::MasstreePartitioner::MasstreePartitioner ( Partitioner parent)
explicit

MasstreePartitioner methods.

Definition at line 53 of file masstree_partitioner_impl.cpp.

References foedus::storage::PartitionerMetadata::locate_data(), and foedus::storage::PartitionerMetadata::valid_.

54  : engine_(parent->get_engine()),
55  id_(parent->get_storage_id()),
56  metadata_(PartitionerMetadata::get_metadata(engine_, id_)) {
57  if (metadata_->valid_) {
58  data_ = reinterpret_cast<MasstreePartitionerData*>(metadata_->locate_data(engine_));
59  } else {
60  data_ = nullptr;
61  }
62 }
bool valid_
Whether this partitioner information (metadata+data) has been constructed.
void * locate_data(Engine *engine)
Returns the partitioner data pointed from this metadata.
Definition: partitioner.cpp:51
static PartitionerMetadata * get_metadata(Engine *engine, StorageId id)
Returns the shared memory for the given storage ID.
Definition: partitioner.cpp:38

Here is the call graph for this function:

Member Function Documentation

ErrorStack foedus::storage::masstree::MasstreePartitioner::design_partition ( const Partitioner::DesignPartitionArguments args)

Definition at line 64 of file masstree_partitioner_impl.cpp.

References foedus::storage::PartitionerMetadata::allocate_data(), ASSERT_ND, foedus::memory::AlignedMemory::assure_capacity(), CHECK_ERROR, foedus::storage::extract_numa_node_from_snapshot_pointer(), foedus::memory::AlignedMemory::get_block(), foedus::Attachable< CONTROL_BLOCK >::get_control_block(), foedus::memory::EngineMemory::get_global_volatile_page_resolver(), foedus::Engine::get_memory_manager(), foedus::Engine::get_soc_count(), foedus::storage::masstree::MasstreePage::is_border(), foedus::storage::masstree::MasstreeIntermediatePointerIterator::is_valid(), foedus::storage::masstree::kInfimumSlice, foedus::storage::kPageSize, foedus::kRetOk, foedus::storage::PartitionerMetadata::locate_data(), foedus::storage::PartitionerMetadata::mutex_, foedus::cache::SnapshotFileSet::read_page(), foedus::memory::GlobalVolatilePageResolver::resolve_offset(), foedus::storage::masstree::MasstreeStorageControlBlock::root_page_pointer_, foedus::storage::Partitioner::DesignPartitionArguments::snapshot_files_, foedus::storage::DualPagePointer::snapshot_pointer_, foedus::storage::PartitionerMetadata::valid_, foedus::storage::DualPagePointer::volatile_pointer_, foedus::storage::Partitioner::DesignPartitionArguments::work_memory_, and WRAP_ERROR_CODE.

Referenced by foedus::storage::Partitioner::design_partition().

65  {
66  MasstreeStorage storage(engine_, id_);
67  MasstreeStorageControlBlock* control_block = storage.get_control_block();
68  const memory::GlobalVolatilePageResolver& resolver
70 
71  // Read current volatile version and previous snapshot version of the root page
72  WRAP_ERROR_CODE(args.work_memory_->assure_capacity(kPageSize * 2));
73  Page* buffers = reinterpret_cast<Page*>(args.work_memory_->get_block());
74  MasstreeIntermediatePage* vol = reinterpret_cast<MasstreeIntermediatePage*>(buffers);
75  MasstreeIntermediatePage* snp = reinterpret_cast<MasstreeIntermediatePage*>(buffers + 1);
76  SnapshotPagePointer snapshot_page_id = control_block->root_page_pointer_.snapshot_pointer_;
77  MasstreeIntermediatePage* root_volatile = reinterpret_cast<MasstreeIntermediatePage*>(
78  resolver.resolve_offset(control_block->root_page_pointer_.volatile_pointer_));
79  CHECK_ERROR(read_page_safe(root_volatile, vol));
80  if (snapshot_page_id != 0) {
81  WRAP_ERROR_CODE(args.snapshot_files_->read_page(snapshot_page_id, snp));
82  }
83 
84  soc::SharedMutexScope scope(&metadata_->mutex_); // protect this metadata
85  if (metadata_->valid_) {
86  // someone has already initialized this??
87  LOG(FATAL) << "Masstree-" << id_ << " partition already designed??:" << *this;
88  }
89  WRAP_ERROR_CODE(metadata_->allocate_data(engine_, &scope, sizeof(MasstreePartitionerData)));
90  data_ = reinterpret_cast<MasstreePartitionerData*>(metadata_->locate_data(engine_));
91 
92  ASSERT_ND(!vol->is_border());
93  if (engine_->get_soc_count() == 1U) {
94  // no partitioning needed
95  data_->partition_count_ = 1;
96  data_->low_keys_[0] = kInfimumSlice;
97  data_->partitions_[0] = 0;
98  } else {
99  // simply the separators in root page is the partition keys.
100  // if we already have a snapshot page, we use the same partition keys, though
101  // assigned nodes might be changed
102  data_->partition_count_ = 0;
103  if (snapshot_page_id == 0) {
104  CHECK_ERROR(design_partition_first(vol));
105  } else {
106  // So far same assignment as previous snapshot.
107  // TODO(Hideaki) check current volatile pages to change the assignments.
108  for (MasstreeIntermediatePointerIterator it(snp); it.is_valid(); it.next()) {
109  data_->low_keys_[data_->partition_count_] = it.get_low_key();
110  SnapshotPagePointer pointer = it.get_pointer().snapshot_pointer_;
111  uint16_t assignment = extract_numa_node_from_snapshot_pointer(pointer);
112  data_->partitions_[data_->partition_count_] = assignment;
113  ++data_->partition_count_;
114  }
115  }
116  }
117 
118  metadata_->valid_ = true;
119  LOG(INFO) << "Masstree-" << id_ << std::endl << " partitions:" << *this;
120  return kRetOk;
121 }
bool valid_
Whether this partitioner information (metadata+data) has been constructed.
const KeySlice kInfimumSlice
const GlobalVolatilePageResolver & get_global_volatile_page_resolver() const
Returns the page resolver to convert volatile page ID to page pointer.
soc::SocId get_soc_count() const
Shorthand for get_options().thread_.group_count_.
Definition: engine.cpp:74
uint64_t SnapshotPagePointer
Page ID of a snapshot page.
Definition: storage_id.hpp:79
ErrorCode allocate_data(Engine *engine, soc::SharedMutexScope *locked, uint32_t data_size)
Allocates a patitioner data in shared memory of the given size.
Definition: partitioner.cpp:61
soc::SharedMutex mutex_
Serialize concurrent initialization of this partitioner.
uint8_t extract_numa_node_from_snapshot_pointer(SnapshotPagePointer pointer)
Definition: storage_id.hpp:95
#define CHECK_ERROR(x)
This macro calls x and checks its returned value.
void * locate_data(Engine *engine)
Returns the partitioner data pointed from this metadata.
Definition: partitioner.cpp:51
const ErrorStack kRetOk
Normal return value for no-error case.
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72
#define WRAP_ERROR_CODE(x)
Same as CHECK_ERROR(x) except it receives only an error code, thus more efficient.
memory::EngineMemory * get_memory_manager() const
See Memory Manager.
Definition: engine.cpp:50
const uint16_t kPageSize
A constant defining the page size (in bytes) of both snapshot pages and volatile pages.
Definition: storage_id.hpp:45

Here is the call graph for this function:

Here is the caller graph for this function:

bool foedus::storage::masstree::MasstreePartitioner::is_partitionable ( ) const

Definition at line 280 of file masstree_partitioner_impl.cpp.

References foedus::storage::masstree::MasstreePartitionerData::partition_count_.

Referenced by foedus::storage::Partitioner::is_partitionable(), and partition_batch().

280  {
281  return data_->partition_count_ > 1U;
282 }

Here is the caller graph for this function:

void foedus::storage::masstree::MasstreePartitioner::partition_batch ( const Partitioner::PartitionBatchArguments args) const

Definition at line 284 of file masstree_partitioner_impl.cpp.

References foedus::debugging::RdtscWatch::elapsed(), foedus::storage::masstree::MasstreePartitionerData::find_partition(), foedus::storage::masstree::MasstreeCommonLogType::get_key(), is_partitionable(), foedus::storage::masstree::MasstreeCommonLogType::key_length_, foedus::storage::Partitioner::PartitionBatchArguments::log_buffer_, foedus::storage::Partitioner::PartitionBatchArguments::log_positions_, foedus::storage::Partitioner::PartitionBatchArguments::logs_count_, foedus::storage::masstree::MasstreePartitionerData::partition_count_, foedus::storage::masstree::resolve_log(), foedus::storage::Partitioner::PartitionBatchArguments::results_, and foedus::debugging::RdtscWatch::stop().

Referenced by foedus::storage::Partitioner::partition_batch().

284  {
285  debugging::RdtscWatch stop_watch;
286  if (!is_partitionable()) {
287  std::memset(args.results_, 0, sizeof(PartitionId) * args.logs_count_);
288  return;
289  }
290 
291  for (uint32_t i = 0; i < args.logs_count_; ++i) {
292  const MasstreeCommonLogType* rec = resolve_log(args.log_buffer_, args.log_positions_[i]);
293  uint16_t key_length = rec->key_length_;
294  const char* key = rec->get_key();
295  args.results_[i] = data_->find_partition(key, key_length);
296  }
297  stop_watch.stop();
298  VLOG(0) << "Masstree-:" << id_ << " took " << stop_watch.elapsed() << "cycles"
299  << " to partition " << args.logs_count_ << " entries. #partitions=" << data_->partition_count_;
300  // if these binary searches are too costly, let's optimize them.
301 }
thread::ThreadGroupId PartitionId
As partition=NUMA node, this is just a synonym of foedus::thread::ThreadGroupId.
Definition: storage_id.hpp:65
uint16_t find_partition(const char *key, uint16_t key_length) const
Returns the partition (node ID) that should contain the key.
const MasstreeCommonLogType * resolve_log(const snapshot::LogBuffer &log_buffer, snapshot::BufferPosition pos)

Here is the call graph for this function:

Here is the caller graph for this function:

void foedus::storage::masstree::MasstreePartitioner::sort_batch ( const Partitioner::SortBatchArguments args) const

Definition at line 489 of file masstree_partitioner_impl.cpp.

References foedus::storage::Partitioner::SortBatchArguments::logs_count_, foedus::storage::Partitioner::SortBatchArguments::longest_key_length_, foedus::storage::Partitioner::SortBatchArguments::shortest_key_length_, and foedus::storage::Partitioner::SortBatchArguments::written_count_.

Referenced by foedus::storage::Partitioner::sort_batch().

489  {
490  if (args.logs_count_ == 0) {
491  *args.written_count_ = 0;
492  return;
493  }
494 
495  if (args.longest_key_length_ == sizeof(KeySlice)
496  && args.shortest_key_length_ == sizeof(KeySlice)) {
497  sort_batch_8bytes(args);
498  } else {
499  sort_batch_general(args);
500  }
501 }
uint64_t KeySlice
Each key slice is an 8-byte integer.

Here is the caller graph for this function:

Friends And Related Function Documentation

std::ostream& operator<< ( std::ostream &  o,
const MasstreePartitioner v 
)
friend

Definition at line 503 of file masstree_partitioner_impl.cpp.

503  {
504  o << "<MasstreePartitioner>";
505  if (v.data_) {
506  o << "<partition_count_>" << v.data_->partition_count_ << "</partition_count_>"
507  << "<partitions>";
508  for (uint16_t i = 0; i < v.data_->partition_count_; ++i) {
509  o << std::endl << " <partition node=\"" << v.data_->partitions_[i] << "\">"
510  << assorted::Hex(v.data_->low_keys_[i], 16)
511  << "</partition>";
512  }
513  o << "</partitions>";
514  } else {
515  o << "Not yet designed";
516  }
517  o << "</MasstreePartitioner>";
518  return o;
519 }

The documentation for this class was generated from the following files: