libfoedus-core
FOEDUS Core Library
foedus::storage::array::ArrayPartitioner Class Referencefinal

Partitioner for an array storage. More...

Detailed Description

Partitioner for an array storage.

There are a few options to implement partitioning for an array with trade-offs between simplicity/efficiency and accuracy/flexibility.

Current policy
So far our choice prefers simplicity/efficiency. We split the whole range of the array into kInteriorFanout buckets and assign the partition based on who currently holds the page under the root page. Designing this policy is extremely simple; we just take a look at the root page of this storage and sees the volatile pointer's NUMA node.
Balancing policy
We so far balance the partition assignments so that no partitition receives more than average buckets where average is buckets/partitions. The excessive bucket is given to needy ones that do not have enough buckets.
Limitations of current policy
Of course this simple policy has some issue. One issue is that if the root page has direct children fewer than the number of partitions, some partition does not receive any bucket even if there are many more indirect children. That doesn't happen so often, though. We outputs warnings if this happens.
Alternative policy
Another choice we considered was a vector of ArrayRange in an arbitrary length over which we do binary search. However, this is more expensive. For a simple data structure like array, it might not pay off.
Note
This is a private implementation-details of Array Storage, thus file name ends with _impl. Do not include this header from a client program. There is no case client program needs to access this internal class.

Definition at line 72 of file array_partitioner_impl.hpp.

#include <array_partitioner_impl.hpp>

Public Member Functions

 ArrayPartitioner (Partitioner *parent)
 
ErrorStack design_partition (const Partitioner::DesignPartitionArguments &args)
 
bool is_partitionable () const
 
void partition_batch (const Partitioner::PartitionBatchArguments &args) const
 
void sort_batch (const Partitioner::SortBatchArguments &args) const
 
ArrayOffset get_array_size () const
 
uint8_t get_array_levels () const
 
const PartitionIdget_bucket_owners () const
 

Friends

std::ostream & operator<< (std::ostream &o, const ArrayPartitioner &v)
 

Constructor & Destructor Documentation

foedus::storage::array::ArrayPartitioner::ArrayPartitioner ( Partitioner parent)
explicit

Definition at line 43 of file array_partitioner_impl.cpp.

References ASSERT_ND, foedus::soc::SharedMutex::is_initialized(), foedus::storage::PartitionerMetadata::locate_data(), foedus::storage::PartitionerMetadata::mutex_, and foedus::storage::PartitionerMetadata::valid_.

44  : engine_(parent->get_engine()),
45  id_(parent->get_storage_id()),
46  metadata_(PartitionerMetadata::get_metadata(engine_, id_)) {
47  ASSERT_ND(metadata_->mutex_.is_initialized());
48  if (metadata_->valid_) {
49  data_ = reinterpret_cast<ArrayPartitionerData*>(metadata_->locate_data(engine_));
50  } else {
51  data_ = nullptr;
52  }
53 }
bool valid_
Whether this partitioner information (metadata+data) has been constructed.
soc::SharedMutex mutex_
Serialize concurrent initialization of this partitioner.
void * locate_data(Engine *engine)
Returns the partitioner data pointed from this metadata.
Definition: partitioner.cpp:51
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72
bool is_initialized() const
static PartitionerMetadata * get_metadata(Engine *engine, StorageId id)
Returns the shared memory for the given storage ID.
Definition: partitioner.cpp:38

Here is the call graph for this function:

Member Function Documentation

ErrorStack foedus::storage::array::ArrayPartitioner::design_partition ( const Partitioner::DesignPartitionArguments args)

Definition at line 73 of file array_partitioner_impl.cpp.

References foedus::storage::PartitionerMetadata::allocate_data(), foedus::storage::array::ArrayPartitionerData::array_levels_, ASSERT_ND, foedus::storage::Storage< CONTROL_BLOCK >::exists(), foedus::storage::extract_numa_node_from_snapshot_pointer(), foedus::storage::array::ArrayStorage::get_array_size(), foedus::Attachable< CONTROL_BLOCK >::get_control_block(), foedus::memory::EngineMemory::get_global_volatile_page_resolver(), foedus::storage::array::ArrayPage::get_interior_record(), foedus::storage::array::ArrayStorage::get_levels(), foedus::Engine::get_memory_manager(), foedus::storage::VolatilePagePointer::get_numa_node(), foedus::Engine::get_options(), foedus::Engine::get_soc_count(), foedus::thread::ThreadOptions::group_count_, foedus::soc::SharedMutex::is_initialized(), foedus::storage::array::ArrayPage::is_leaf(), foedus::storage::VolatilePagePointer::is_null(), foedus::storage::array::kInteriorFanout, foedus::kRetOk, foedus::storage::PartitionerMetadata::locate_data(), foedus::storage::PartitionerMetadata::mutex_, foedus::memory::GlobalVolatilePageResolver::resolve_offset(), foedus::storage::DualPagePointer::snapshot_pointer_, foedus::EngineOptions::thread_, foedus::storage::PartitionerMetadata::valid_, foedus::storage::DualPagePointer::volatile_pointer_, and WRAP_ERROR_CODE.

Referenced by foedus::storage::Partitioner::design_partition().

74  {
75  ASSERT_ND(metadata_->mutex_.is_initialized());
76  ASSERT_ND(data_ == nullptr);
77  ArrayStorage storage(engine_, id_);
78  ASSERT_ND(storage.exists());
79  ArrayStorageControlBlock* control_block = storage.get_control_block();
80 
81  soc::SharedMutexScope mutex_scope(&metadata_->mutex_);
82  ASSERT_ND(!metadata_->valid_);
83  WRAP_ERROR_CODE(metadata_->allocate_data(engine_, &mutex_scope, sizeof(ArrayPartitionerData)));
84  data_ = reinterpret_cast<ArrayPartitionerData*>(metadata_->locate_data(engine_));
85 
86  data_->array_levels_ = storage.get_levels();
87  data_->array_size_ = storage.get_array_size();
88 
89  if (storage.get_levels() == 1U || engine_->get_soc_count() == 1U) {
90  // No partitioning needed.
91  data_->bucket_owners_[0] = 0;
92  data_->partitionable_ = false;
93  data_->bucket_size_ = data_->array_size_;
94  metadata_->valid_ = true;
95  return kRetOk;
96  }
97 
98  data_->partitionable_ = true;
99  ASSERT_ND(storage.get_levels() >= 2U);
100 
101  // bucket size is interval that corresponds to a direct child of root.
102  // eg) levels==2 : leaf, levels==3: leaf*kInteriorFanout, ...
103  data_->bucket_size_ = control_block->route_finder_.get_records_in_leaf();
104  for (uint32_t level = 1; level < storage.get_levels() - 1U; ++level) {
105  data_->bucket_size_ *= kInteriorFanout;
106  }
107 
108  const memory::GlobalVolatilePageResolver& resolver
110  // root page is guaranteed to have volatile version.
111  ArrayPage* root_page = reinterpret_cast<ArrayPage*>(
112  resolver.resolve_offset(control_block->root_page_pointer_.volatile_pointer_));
113  ASSERT_ND(!root_page->is_leaf());
114 
115  // how many direct children does this root page have?
116  uint16_t direct_children = storage.get_array_size() / data_->bucket_size_ + 1U;
117  if (storage.get_array_size() % data_->bucket_size_ != 0) {
118  ++direct_children;
119  }
120 
121  // do we have enough direct children? if not, some partition will not receive buckets.
122  // Although it's not a critical error, let's log it as an error.
123  uint16_t total_partitions = engine_->get_options().thread_.group_count_;
124  ASSERT_ND(total_partitions > 1U); // if not, why we are coming here. it's a waste.
125 
126  if (direct_children < total_partitions) {
127  LOG(ERROR) << "Warning-like error: This array doesn't have enough direct children in root"
128  " page to assign partitions. #partitions=" << total_partitions << ", #direct children="
129  << direct_children << ". array=" << storage;
130  }
131 
132  // two paths. first path simply sees volatile/snapshot pointer and determines owner.
133  // second path addresses excessive assignments, off loading them to needy ones.
134  std::vector<uint16_t> counts(total_partitions, 0);
135  const uint16_t excessive_count = (direct_children / total_partitions) + 1;
136  std::vector<uint16_t> excessive_children;
137  for (uint16_t child = 0; child < direct_children; ++child) {
138  const DualPagePointer &pointer = root_page->get_interior_record(child);
139  PartitionId partition;
140  if (!pointer.volatile_pointer_.is_null()) {
141  partition = pointer.volatile_pointer_.get_numa_node();
142  } else {
143  // if no volatile page, see snapshot page owner.
144  partition = extract_numa_node_from_snapshot_pointer(pointer.snapshot_pointer_);
145  // this ignores the case where neither snapshot/volatile page is there.
146  // however, as we create all pages at ArrayStorage::create(), this so far never happens.
147  }
148  ASSERT_ND(partition < total_partitions);
149  if (counts[partition] >= excessive_count) {
150  excessive_children.push_back(child);
151  } else {
152  ++counts[partition];
153  data_->bucket_owners_[child] = partition;
154  }
155  }
156 
157  // just add it to the one with least assignments.
158  // a stupid loop, but this part won't be a bottleneck (only 250 elements).
159  for (uint16_t child : excessive_children) {
160  PartitionId most_needy = 0;
161  for (PartitionId partition = 1; partition < total_partitions; ++partition) {
162  if (counts[partition] < counts[most_needy]) {
163  most_needy = partition;
164  }
165  }
166 
167  ++counts[most_needy];
168  data_->bucket_owners_[child] = most_needy;
169  }
170 
171  metadata_->valid_ = true;
172  return kRetOk;
173 }
bool valid_
Whether this partitioner information (metadata+data) has been constructed.
thread::ThreadGroupId PartitionId
As partition=NUMA node, this is just a synonym of foedus::thread::ThreadGroupId.
Definition: storage_id.hpp:65
const GlobalVolatilePageResolver & get_global_volatile_page_resolver() const
Returns the page resolver to convert volatile page ID to page pointer.
const EngineOptions & get_options() const
Definition: engine.cpp:39
soc::SocId get_soc_count() const
Shorthand for get_options().thread_.group_count_.
Definition: engine.cpp:74
ErrorCode allocate_data(Engine *engine, soc::SharedMutexScope *locked, uint32_t data_size)
Allocates a patitioner data in shared memory of the given size.
Definition: partitioner.cpp:61
soc::SharedMutex mutex_
Serialize concurrent initialization of this partitioner.
uint8_t extract_numa_node_from_snapshot_pointer(SnapshotPagePointer pointer)
Definition: storage_id.hpp:95
uint16_t group_count_
Number of ThreadGroup in the engine.
thread::ThreadOptions thread_
void * locate_data(Engine *engine)
Returns the partitioner data pointed from this metadata.
Definition: partitioner.cpp:51
const ErrorStack kRetOk
Normal return value for no-error case.
const uint16_t kInteriorFanout
Max number of entries in an interior page of array storage.
Definition: array_id.hpp:110
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72
#define WRAP_ERROR_CODE(x)
Same as CHECK_ERROR(x) except it receives only an error code, thus more efficient.
bool is_initialized() const
memory::EngineMemory * get_memory_manager() const
See Memory Manager.
Definition: engine.cpp:50

Here is the call graph for this function:

Here is the caller graph for this function:

uint8_t foedus::storage::array::ArrayPartitioner::get_array_levels ( ) const

Definition at line 60 of file array_partitioner_impl.cpp.

References foedus::storage::array::ArrayPartitionerData::array_levels_, and ASSERT_ND.

60  {
61  ASSERT_ND(data_);
62  return data_->array_levels_;
63 }
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72
ArrayOffset foedus::storage::array::ArrayPartitioner::get_array_size ( ) const

Definition at line 55 of file array_partitioner_impl.cpp.

References foedus::storage::array::ArrayPartitionerData::array_size_, and ASSERT_ND.

Referenced by partition_batch().

55  {
56  ASSERT_ND(data_);
57  return data_->array_size_;
58 }
ArrayOffset array_size_
Size of the entire array.
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72

Here is the caller graph for this function:

const PartitionId * foedus::storage::array::ArrayPartitioner::get_bucket_owners ( ) const

Definition at line 64 of file array_partitioner_impl.cpp.

References ASSERT_ND, and foedus::storage::array::ArrayPartitionerData::bucket_owners_.

Referenced by partition_batch().

64  {
65  ASSERT_ND(data_);
66  return data_->bucket_owners_;
67 }
PartitionId bucket_owners_[kInteriorFanout]
partition of each bucket.
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72

Here is the caller graph for this function:

bool foedus::storage::array::ArrayPartitioner::is_partitionable ( ) const

Definition at line 68 of file array_partitioner_impl.cpp.

References ASSERT_ND, and foedus::storage::array::ArrayPartitionerData::partitionable_.

Referenced by foedus::storage::Partitioner::is_partitionable(), and partition_batch().

68  {
69  ASSERT_ND(data_);
70  return data_->partitionable_;
71 }
bool partitionable_
if false, every record goes to node-0.
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72

Here is the caller graph for this function:

void foedus::storage::array::ArrayPartitioner::partition_batch ( const Partitioner::PartitionBatchArguments args) const

Definition at line 175 of file array_partitioner_impl.cpp.

References ASSERT_ND, foedus::storage::array::ArrayPartitionerData::bucket_size_, foedus::assorted::ConstDiv::div64(), get_array_size(), get_bucket_owners(), foedus::log::BaseLogType::header_, is_partitionable(), foedus::storage::array::kInteriorFanout, foedus::log::kLogCodeArrayIncrement, foedus::log::kLogCodeArrayOverwrite, foedus::storage::Partitioner::PartitionBatchArguments::log_buffer_, foedus::storage::Partitioner::PartitionBatchArguments::log_positions_, foedus::log::LogHeader::log_type_code_, foedus::storage::Partitioner::PartitionBatchArguments::logs_count_, foedus::storage::array::ArrayCommonUpdateLogType::offset_, foedus::snapshot::LogBuffer::resolve(), foedus::storage::Partitioner::PartitionBatchArguments::results_, and foedus::log::LogHeader::storage_id_.

Referenced by foedus::storage::Partitioner::partition_batch().

175  {
176  if (!is_partitionable()) {
177  std::memset(args.results_, 0, sizeof(PartitionId) * args.logs_count_);
178  return;
179  }
180 
181  ASSERT_ND(data_->bucket_size_ > 0);
182  assorted::ConstDiv bucket_size_div(data_->bucket_size_);
183  for (uint32_t i = 0; i < args.logs_count_; ++i) {
184  const ArrayCommonUpdateLogType *log = reinterpret_cast<const ArrayCommonUpdateLogType*>(
185  args.log_buffer_.resolve(args.log_positions_[i]));
186  ASSERT_ND(log->header_.log_type_code_ == log::kLogCodeArrayOverwrite
187  || log->header_.log_type_code_ == log::kLogCodeArrayIncrement);
188  ASSERT_ND(log->header_.storage_id_ == id_);
189  ASSERT_ND(log->offset_ < get_array_size());
190  uint64_t bucket = bucket_size_div.div64(log->offset_);
191  ASSERT_ND(bucket < kInteriorFanout);
192  args.results_[i] = get_bucket_owners()[bucket];
193  }
194 }
thread::ThreadGroupId PartitionId
As partition=NUMA node, this is just a synonym of foedus::thread::ThreadGroupId.
Definition: storage_id.hpp:65
ArrayOffset bucket_size_
bucket = offset / bucket_size_.
0x0023 : foedus::storage::array::ArrayIncrementLogType .
Definition: log_type.hpp:116
0x0022 : foedus::storage::array::ArrayOverwriteLogType .
Definition: log_type.hpp:115
const uint16_t kInteriorFanout
Max number of entries in an interior page of array storage.
Definition: array_id.hpp:110
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72

Here is the call graph for this function:

Here is the caller graph for this function:

void foedus::storage::array::ArrayPartitioner::sort_batch ( const Partitioner::SortBatchArguments args) const

Definition at line 312 of file array_partitioner_impl.cpp.

References ASSERT_ND, foedus::memory::AlignedMemory::assure_capacity(), foedus::storage::array::compact_logs(), foedus::debugging::StopWatch::elapsed_ms(), foedus::memory::AlignedMemory::get_block(), foedus::storage::Partitioner::SortBatchArguments::logs_count_, foedus::storage::array::prepare_sort_entries(), foedus::debugging::StopWatch::stop(), foedus::storage::Partitioner::SortBatchArguments::work_memory_, and foedus::storage::Partitioner::SortBatchArguments::written_count_.

Referenced by foedus::storage::Partitioner::sort_batch().

312  {
313  if (args.logs_count_ == 0) {
314  *args.written_count_ = 0;
315  return;
316  }
317 
318  // we so far sort them in one path.
319  // to save memory, we could do multi-path merge-sort.
320  // however, in reality each log has many bytes, so log_count is not that big.
321  args.work_memory_->assure_capacity(sizeof(SortEntry) * args.logs_count_);
322 
323  debugging::StopWatch stop_watch_entire;
324 
325  ASSERT_ND(sizeof(SortEntry) == 16U);
326  SortEntry* entries = reinterpret_cast<SortEntry*>(args.work_memory_->get_block());
327  prepare_sort_entries(args, entries);
328 
329  debugging::StopWatch stop_watch;
330  // Gave up non-gcc support because of aarch64 support. yes, we can also assume __uint128_t.
331  // CPU profile of partition_array_perf: 50% (introsort_loop) + 7% (other inlined).
332  std::sort(
333  reinterpret_cast<__uint128_t*>(entries),
334  reinterpret_cast<__uint128_t*>(entries + args.logs_count_));
335  stop_watch.stop();
336  VLOG(0) << "Sorted " << args.logs_count_ << " log entries in " << stop_watch.elapsed_ms() << "ms";
337 
338  uint32_t result_count = compact_logs(args, entries);
339 
340  stop_watch_entire.stop();
341  VLOG(0) << "Array-" << id_ << " sort_batch() done in " << stop_watch_entire.elapsed_ms()
342  << "ms for " << args.logs_count_ << " log entries, compacted them to"
343  << result_count << " log entries";
344  *args.written_count_ = result_count;
345 }
uint32_t compact_logs(const Partitioner::SortBatchArguments &args, SortEntry *entries)
subroutine of sort_batch
void prepare_sort_entries(const Partitioner::SortBatchArguments &args, SortEntry *entries)
subroutine of sort_batch
#define ASSERT_ND(x)
A warning-free wrapper macro of assert() that has no performance effect in release mode even when 'x'...
Definition: assert_nd.hpp:72

Here is the call graph for this function:

Here is the caller graph for this function:

Friends And Related Function Documentation

std::ostream& operator<< ( std::ostream &  o,
const ArrayPartitioner v 
)
friend

Definition at line 347 of file array_partitioner_impl.cpp.

347  {
348  o << "<ArrayPartitioner>";
349  if (v.data_) {
350  o << "<array_size_>" << v.data_->array_size_ << "</array_size_>"
351  << "<bucket_size_>" << v.data_->bucket_size_ << "</bucket_size_>";
352  for (uint16_t i = 0; i < kInteriorFanout; ++i) {
353  o << "<range bucket=\"" << i << "\" partition=\"" << v.data_->bucket_owners_[i] << "\" />";
354  }
355  } else {
356  o << "Not yet designed";
357  }
358  o << "</ArrayPartitioner>";
359  return o;
360 }
const uint16_t kInteriorFanout
Max number of entries in an interior page of array storage.
Definition: array_id.hpp:110

The documentation for this class was generated from the following files: