A system transaction to split a border page in Master-Tree. More...

Detailed Description

A system transaction to split a border page in Master-Tree.

See also: Physical-only, short-living system transactions.

When a border page becomes full or close to full, we split the page into two border pages. The new pages are placed as tentative foster twins of the page.

This does nothing and returns kErrorCodeOk in the following cases:

The page turns out to be already split.

Locks taken in this sysxct (in order of taking):

Page-lock of the target page.
Record-lock of all records in the target page (in canonical order).

At least after releasing enclosing user transaction's locks, there is no chance of deadlocks or even any conditional locks. max_retries=2 should be enough in run_nested_sysxct().

Definition at line 51 of file masstree_split_impl.hpp.

#include <masstree_split_impl.hpp>

Inheritance diagram for foedus::storage::masstree::SplitBorder:

[legend]

Collaboration diagram for foedus::storage::masstree::SplitBorder:

[legend]

Classes
struct	SplitStrategy

Public Member Functions
	SplitBorder (thread::Thread context, MasstreeBorderPage target, KeySlice trigger, bool disable_no_record_split=false, bool piggyback_reserve=false, KeyLength piggyback_remainder_length=0, PayloadLength piggyback_payload_count=0, const void *piggyback_suffix=nullptr)

virtual ErrorCode	run (xct::SysxctWorkspace *sysxct_workspace) override
	Border node's Split. More...

void	decide_strategy (SplitStrategy *out) const
	Subroutine to decide how we will split this page. More...

ErrorCode	lock_existing_records (xct::SysxctWorkspace *sysxct_workspace)
	Subroutine to lock existing records in target_. More...

void	migrate_records (KeySlice inclusive_from, KeySlice inclusive_to, MasstreeBorderPage *dest) const
	Subroutine to construct a new page. More...

Public Attributes
thread::Thread *const	context_
	Thread context. More...

MasstreeBorderPage *const	target_
	The page to split. More...

const KeySlice	trigger_
	The key that triggered this split. More...

const bool	disable_no_record_split_
	If true, we never do no-record-split (NRS). More...

const bool	piggyback_reserve_
	An optimization to also make room for a record. More...

const KeyLength	piggyback_remainder_length_

const PayloadLength	piggyback_payload_count_

const void *	piggyback_suffix_

Constructor & Destructor Documentation

foedus::storage::masstree::SplitBorder::SplitBorder	(	thread::Thread *	context,
		MasstreeBorderPage *	target,
		KeySlice	trigger,
		bool	disable_no_record_split = `false`,
		bool	piggyback_reserve = `false`,
		KeyLength	piggyback_remainder_length = `0`,
		PayloadLength	piggyback_payload_count = `0`,
		const void *	piggyback_suffix = `nullptr`
	)

inline

Definition at line 82 of file masstree_split_impl.hpp.

     : xct::SysxctFunctor(),
       context_(context),
       target_(target),
       trigger_(trigger),
       disable_no_record_split_(disable_no_record_split),
       piggyback_reserve_(piggyback_reserve),
       piggyback_remainder_length_(piggyback_remainder_length),
       piggyback_payload_count_(piggyback_payload_count),
       piggyback_suffix_(piggyback_suffix) {
   }

Member Function Documentation

void foedus::storage::masstree::SplitBorder::decide_strategy ( SplitBorder::SplitStrategy * out ) const

Subroutine to decide how we will split this page.

Definition at line 146 of file masstree_split_impl.cpp.

Referenced by run().

                                                                      {
   ASSERT_ND(target_->is_locked());
   const SlotIndex key_count = target_->get_key_count();
   ASSERT_ND(key_count > 0);
   out->original_key_count_ = key_count;
   out->no_record_split_ = false;
   out->smallest_slice_ = target_->get_slice(0);
   out->largest_slice_ = target_->get_slice(0);
 
   // if consecutive_inserts_, we are already sure about the key distributions, so easy.
   if (target_->is_consecutive_inserts()) {
     out->largest_slice_ = target_->get_slice(key_count - 1);
     if (!disable_no_record_split_ && trigger_ > out->largest_slice_) {
       out->no_record_split_ = true;
       DVLOG(1) << "Obviously no record split. key_count=" << static_cast<int>(key_count);
       out->mid_slice_ = out->largest_slice_ + 1;
     } else {
       if (disable_no_record_split_ && trigger_ > out->largest_slice_) {
         DVLOG(1) << "No-record split was possible, but disable_no_record_split specified."
           << " simply splitting in half...";
       }
       DVLOG(1) << "Breaks a sequential page. key_count=" << static_cast<int>(key_count);
       out->mid_slice_ = target_->get_slice(key_count / 2);
     }
     return;
   }
 
   for (SlotIndex i = 1; i < key_count; ++i) {
     const KeySlice this_slice = target_->get_slice(i);
     out->smallest_slice_ = std::min<KeySlice>(this_slice, out->smallest_slice_);
     out->largest_slice_ = std::max<KeySlice>(this_slice, out->largest_slice_);
   }
 
   ASSERT_ND(key_count >= 2U);  // because it's not consecutive, there must be at least 2 records.
 
   {
     // even if not, there is another easy case where two "tides" mix in this page;
     // one tide from left sequentially inserts keys while another tide from right also sequentially
     // inserts keys that are larger than left tide. This usually happens at the boundary of
     // two largely independent partitions (eg multiple threads inserting keys of their partition).
     // In that case, we should cleanly separate the two tides by picking the smallest key from
     // right-tide as the separator.
     KeySlice tides_max[2];
     KeySlice second_tide_min = kInfimumSlice;
     bool first_tide_broken = false;
     bool both_tides_broken = false;
     tides_max[0] = target_->get_slice(0);
     // for example, consider the following case:
     //   1 2 32 33 3 4 34 x
     // There are two tides 1- and 32-. We detect them as follows.
     // We initially consider 1,2,32,33 as the first tide because they are sequential.
     // Then, "3" breaks the first tide. We then consider 1- and 32- as the two tides.
     // If x breaks the tide again, we give up.
     for (SlotIndex i = 1; i < key_count; ++i) {
       // look for "tide breaker" that is smaller than the max of the tide.
       // as soon as we found two of them (meaning 3 tides or more), we give up.
       KeySlice slice = target_->get_slice(i);
       if (!first_tide_broken)  {
         if (slice >= tides_max[0]) {
           tides_max[0] = slice;
           continue;  // ok!
         } else {
           // let's find where a second tide starts.
           first_tide_broken = true;
           SlotIndex first_breaker;
           for (first_breaker = 0; first_breaker < i; ++first_breaker) {
             const KeySlice breaker_slice = target_->get_slice(first_breaker);
             if (breaker_slice > slice) {
               break;
             }
           }
           ASSERT_ND(first_breaker < i);
           tides_max[0] = slice;
           ASSERT_ND(second_tide_min == kInfimumSlice);
           second_tide_min = target_->get_slice(first_breaker);
           tides_max[1] = target_->get_slice(i - 1);
           ASSERT_ND(tides_max[0] < tides_max[1]);
           ASSERT_ND(tides_max[0] < second_tide_min);
           ASSERT_ND(second_tide_min <= tides_max[1]);
         }
       } else {
         if (slice < second_tide_min && slice >= tides_max[0]) {
           tides_max[0] = slice;
           continue;  // fine, in the first tide
         } else if (slice >= tides_max[1]) {
           tides_max[1] = slice;  // okay, in the second tide
         } else {
           DVLOG(2) << "Oops, third tide. not the easy case";
           both_tides_broken = true;
           break;
         }
       }
     }
 
     // Already sorted? (seems consecutive_inserts_ has some false positives)
     if (!first_tide_broken) {
       if (!disable_no_record_split_ && trigger_ > out->largest_slice_) {
         out->no_record_split_ = true;
         DVLOG(1) << "Obviously no record split. key_count=" << static_cast<int>(key_count);
         out->mid_slice_ = out->largest_slice_ + 1;
       } else {
         if (disable_no_record_split_ && trigger_ > out->largest_slice_) {
           DVLOG(1) << "No-record split was possible, but disable_no_record_split specified."
             << " simply splitting in half...";
         }
         DVLOG(1) << "Breaks a sequential page. key_count=" << static_cast<int>(key_count);
         out->mid_slice_ = target_->get_slice(key_count / 2);
       }
       return;
     }
 
     ASSERT_ND(first_tide_broken);
     if (!both_tides_broken) {
       DVLOG(0) << "Yay, figured out two-tides meeting in a page.";
       out->mid_slice_ = second_tide_min;
       return;
     }
   }
 
 
   // now we have to pick separator. as we don't sort in-page, this is approximate median selection.
   // there are a few smart algorithm out there, but we don't need that much accuracy.
   // just randomly pick a few. good enough.
   assorted::UniformRandom uniform_random(12345);
   const SlotIndex kSamples = 7;
   KeySlice choices[kSamples];
   for (uint8_t i = 0; i < kSamples; ++i) {
     choices[i] = target_->get_slice(uniform_random.uniform_within(0, key_count - 1));
   }
   std::sort(choices, choices + kSamples);
   out->mid_slice_ = choices[kSamples / 2];
 
   // scan through again to make sure the new separator is not used multiple times as key.
   // this is required for the invariant "same slices must be in same page"
   while (true) {
     bool observed = false;
     bool retry = false;
     for (SlotIndex i = 0; i < key_count; ++i) {
       const KeySlice this_slice = target_->get_slice(i);
       if (this_slice == out->mid_slice_) {
         if (observed) {
           // the key appeared twice! let's try another slice.
           ++out->mid_slice_;
           retry = true;
           break;
         } else {
           observed = true;
         }
       }
     }
     if (retry) {
       continue;
     } else {
       break;
     }
   }
 }

Here is the call graph for this function:

Here is the caller graph for this function:

ErrorCode foedus::storage::masstree::SplitBorder::lock_existing_records ( xct::SysxctWorkspace * sysxct_workspace )

Subroutine to lock existing records in target_.

Definition at line 304 of file masstree_split_impl.cpp.

References ASSERT_ND, CHECK_ERROR_CODE, context_, foedus::debugging::RdtscWatch::elapsed(), foedus::storage::masstree::MasstreePage::get_key_count(), foedus::storage::masstree::MasstreeBorderPage::get_owner_id(), foedus::thread::Thread::get_thread_id(), foedus::storage::masstree::MasstreePage::header(), foedus::xct::RwLockableXctId::is_keylocked(), foedus::storage::masstree::MasstreePage::is_locked(), foedus::storage::masstree::kBorderPageMaxSlots, foedus::kErrorCodeOk, foedus::storage::PageHeader::page_id_, foedus::debugging::RdtscWatch::stop(), foedus::storage::PageHeader::storage_id_, foedus::thread::Thread::sysxct_batch_record_locks(), and target_.

Referenced by run().

                                                                                  {
   debugging::RdtscWatch watch;  // check how expensive this is
   ASSERT_ND(target_->is_locked());
   const SlotIndex key_count = target_->get_key_count();
   ASSERT_ND(key_count > 0);
 
   // We use the batched interface. It internally sorts, but has better performance if
   // we provide an already-sorted input. Remember that slots grow backwards,
   // so larger indexes have smaller lock IDs.
   xct::RwLockableXctId* record_locks[kBorderPageMaxSlots];
   for (SlotIndex i = 0; i < key_count; ++i) {
     record_locks[i] = target_->get_owner_id(key_count - 1U - i);  // larger indexes first
   }
 
   VolatilePagePointer page_id(target_->header().page_id_);
   CHECK_ERROR_CODE(context_->sysxct_batch_record_locks(
     sysxct_workspace,
     page_id,
     key_count,
     record_locks));
 
 #ifndef NDEBUG
   for (SlotIndex i = 0; i < key_count; ++i) {
     xct::RwLockableXctId* owner_id = target_->get_owner_id(i);
     ASSERT_ND(owner_id->is_keylocked());
   }
 #endif  // NDEBUG
   watch.stop();
   DVLOG(1) << "Costed " << watch.elapsed() << " cycles to lock all of "
     << static_cast<int>(key_count) << " records while splitting";
   if (watch.elapsed() > (1ULL << 26)) {
     // if we see this often, we have to optimize this somehow.
     LOG(WARNING) << "wait, wait, it costed " << watch.elapsed() << " cycles to lock all of "
       << static_cast<int>(key_count) << " records while splitting!! that's a lot! storage="
       << target_->header().storage_id_
       << ", thread ID=" << context_->get_thread_id();
   }
 
   return kErrorCodeOk;
 }

Here is the call graph for this function:

Here is the caller graph for this function:

void foedus::storage::masstree::SplitBorder::migrate_records	(	KeySlice	inclusive_from,
		KeySlice	inclusive_to,
		MasstreeBorderPage *	dest
	)		const

Subroutine to construct a new page.

Definition at line 345 of file masstree_split_impl.cpp.

Referenced by run().

                                   {
   ASSERT_ND(target_->is_locked());
   const auto& copy_from = *target_;
   const SlotIndex key_count = target_->get_key_count();
   ASSERT_ND(dest->get_key_count() == 0);
   dest->next_offset_ = 0;
   SlotIndex migrated_count = 0;
   DataOffset unused_space = sizeof(dest->data_);
   bool sofar_consecutive = true;
   KeySlice prev_slice = kSupremumSlice;
   KeyLength prev_remainder = kMaxKeyLength;
 
   // Simply iterate over and memcpy one-by-one.
   // We previously did a bit more complex thing to copy as many records as
   // possible in one memcpy, but not worth it with the new page layout.
   // We will keep an eye on the cost of this method, and optimize when it becomes bottleneck.
   for (SlotIndex i = 0; i < key_count; ++i) {
     const KeySlice from_slice = copy_from.get_slice(i);
     if (from_slice >= inclusive_from && from_slice <= inclusive_to) {
       // move this record.
       auto* to_slot = dest->get_new_slot(migrated_count);
       const auto* from_slot = copy_from.get_slot(i);
       ASSERT_ND(from_slot->tid_.is_keylocked());
       const KeyLength from_remainder = from_slot->remainder_length_;
       const KeyLength from_suffix = calculate_suffix_length(from_remainder);
       const PayloadLength payload = from_slot->lengthes_.components.payload_length_;
       const KeyLength to_remainder
         = to_slot->tid_.xct_id_.is_next_layer() ? kInitiallyNextLayer : from_remainder;
       const KeyLength to_suffix = calculate_suffix_length(to_remainder);
       if (to_remainder != from_remainder) {
         ASSERT_ND(to_remainder == kInitiallyNextLayer);
         ASSERT_ND(from_remainder != kInitiallyNextLayer && from_remainder <= kMaxKeyLength);
         DVLOG(2) << "the old record is now a next-layer record, this new record can be initially"
           " a next-layer, saving space for suffixes. from_remainder=" << from_remainder;
       }
 
       dest->set_slice(migrated_count, from_slice);
       to_slot->tid_.xct_id_ = from_slot->tid_.xct_id_;
       to_slot->tid_.lock_.reset();
       to_slot->remainder_length_ = to_remainder;
       to_slot->lengthes_.components.payload_length_ = payload;
       // offset/physical_length set later
 
       if (sofar_consecutive && migrated_count > 0) {
         if (prev_slice > from_slice
           || (prev_slice == from_slice && prev_remainder > from_remainder)) {
           sofar_consecutive = false;
         }
       }
       prev_slice = from_slice;
       prev_remainder = to_remainder;
 
       // we migh shrink the physical record size.
       const DataOffset record_length = MasstreeBorderPage::to_record_length(to_remainder, payload);
       ASSERT_ND(record_length % 8 == 0);
       ASSERT_ND(record_length <= from_slot->lengthes_.components.physical_record_length_);
       to_slot->lengthes_.components.physical_record_length_ = record_length;
       to_slot->lengthes_.components.offset_ = dest->next_offset_;
       to_slot->original_physical_record_length_ = record_length;
       to_slot->original_offset_ = dest->next_offset_;
       dest->next_offset_ += record_length;
       unused_space -= record_length - sizeof(*to_slot);
 
       // Copy the record. We want to do it in one memcpy if possible.
       // Be careful on the case where suffix length has changed (kInitiallyNextLayer case)
       if (record_length > 0) {
         char* to_record = dest->get_record_from_offset(to_slot->lengthes_.components.offset_);
         if (from_suffix != to_suffix) {
           ASSERT_ND(to_remainder == kInitiallyNextLayer);
           ASSERT_ND(from_remainder != kInitiallyNextLayer && from_remainder <= kMaxKeyLength);
           ASSERT_ND(to_suffix == 0);
           // Skip suffix part and copy only the payload.
           std::memcpy(
             to_record,
             copy_from.get_record_payload(i),
             assorted::align8(payload));
         } else {
           // Copy suffix (if exists) and payload together.
           std::memcpy(to_record, copy_from.get_record(i), record_length);
         }
       }
 
       ++migrated_count;
       dest->set_key_count(migrated_count);
     }
   }
 
   dest->consecutive_inserts_ = sofar_consecutive;
 }

Here is the call graph for this function:

Here is the caller graph for this function:

ErrorCode foedus::storage::masstree::SplitBorder::run ( xct::SysxctWorkspace * sysxct_workspace )

overridevirtual

Border node's Split.

Implements foedus::xct::SysxctFunctor.

Definition at line 39 of file masstree_split_impl.cpp.

                                                                {
   ASSERT_ND(!target_->header().snapshot_);
   ASSERT_ND(!target_->is_empty_range());
 
   debugging::RdtscWatch watch;
   DVLOG(1) << "Splitting a page... ";
 
   // First, lock the page, The page's lock state is before all the records in the page,
   // so we can simply lock it first.
   CHECK_ERROR_CODE(context_->sysxct_page_lock(sysxct_workspace, reinterpret_cast<Page*>(target_)));
 
   // The lock involves atomic operation, so now all we see are finalized.
   if (target_->has_foster_child()) {
     DVLOG(0) << "Interesting. the page has been already split";
     return kErrorCodeOk;
   }
   if (target_->get_key_count() <= 1U) {
     DVLOG(0) << "This page has too few records. Can't split it";
     return kErrorCodeOk;
   }
 
   const SlotIndex key_count = target_->get_key_count();
 
   // 2 free volatile pages needed.
   // foster-minor/major (will be placed in successful case)
   memory::PagePoolOffset offsets[2];
   thread::GrabFreeVolatilePagesScope free_pages_scope(context_, offsets);
   CHECK_ERROR_CODE(free_pages_scope.grab(2));
   const auto& resolver = context_->get_local_volatile_page_resolver();
 
   SplitStrategy strategy;  // small. just place it on stack
   decide_strategy(&strategy);
   ASSERT_ND(target_->get_low_fence() <= strategy.mid_slice_);
   ASSERT_ND(strategy.mid_slice_ <= target_->get_high_fence());
   MasstreeBorderPage* twin[2];
   VolatilePagePointer new_page_ids[2];
   for (int i = 0; i < 2; ++i) {
     twin[i] = reinterpret_cast<MasstreeBorderPage*>(resolver.resolve_offset_newpage(offsets[i]));
     new_page_ids[i].set(context_->get_numa_node(), offsets[i]);
     twin[i]->initialize_volatile_page(
       target_->header().storage_id_,
       new_page_ids[i],
       target_->get_layer(),
       i == 0 ? target_->get_low_fence() : strategy.mid_slice_,  // low-fence
       i == 0 ? strategy.mid_slice_ : target_->get_high_fence());  // high-fence
   }
 
   // lock all records
   CHECK_ERROR_CODE(lock_existing_records(sysxct_workspace));
 
   if (strategy.no_record_split_) {
     ASSERT_ND(!disable_no_record_split_);
     // in this case, we can move all records in one memcpy.
     // well, actually two : one for slices and another for data.
     std::memcpy(twin[0]->slices_, target_->slices_, sizeof(KeySlice) * key_count);
     std::memcpy(twin[0]->data_, target_->data_, sizeof(target_->data_));
     twin[0]->set_key_count(key_count);
     twin[1]->set_key_count(0);
     twin[0]->consecutive_inserts_ = target_->is_consecutive_inserts();
     twin[1]->consecutive_inserts_ = true;
     twin[0]->next_offset_ = target_->get_next_offset();
     twin[1]->next_offset_ = 0;
     for (SlotIndex i = 0; i < key_count; ++i) {
       xct::RwLockableXctId* owner_id = twin[0]->get_owner_id(i);
       ASSERT_ND(owner_id->is_keylocked());
       owner_id->get_key_lock()->reset();  // no race
     }
   } else {
     migrate_records(
       strategy.smallest_slice_,
       strategy.mid_slice_ - 1,  // to make it inclusive
       twin[0]);
     migrate_records(
       strategy.mid_slice_,
       strategy.largest_slice_,  // this is inclusive (to avoid supremum hassles)
       twin[1]);
   }
 
   // Now we will install the new pages. **From now on no error-return allowed**
   assorted::memory_fence_release();
   // We install pointers to the pages AFTER we initialize the pages.
   target_->install_foster_twin(new_page_ids[0], new_page_ids[1], strategy.mid_slice_);
   free_pages_scope.dispatch(0);
   free_pages_scope.dispatch(1);
   assorted::memory_fence_release();
 
   // invoking set_moved is the point we announce all of these changes. take fence to make it right
   target_->get_version_address()->set_moved();
   assorted::memory_fence_release();
 
   // set the "moved" bit so that concurrent transactions
   // check foster-twin for read-set/write-set checks.
   for (SlotIndex i = 0; i < key_count; ++i) {
     xct::RwLockableXctId* owner_id = target_->get_owner_id(i);
     owner_id->xct_id_.set_moved();
   }
 
   assorted::memory_fence_release();
 
   watch.stop();
   DVLOG(1) << "Costed " << watch.elapsed() << " cycles to split a page. original page physical"
     << " record count: " << static_cast<int>(key_count)
     << "->" << static_cast<int>(twin[0]->get_key_count())
     << " + " << static_cast<int>(twin[1]->get_key_count());
   return kErrorCodeOk;
 }

Here is the call graph for this function:

Member Data Documentation

thread::Thread* const foedus::storage::masstree::SplitBorder::context_

Thread context.

Definition at line 53 of file masstree_split_impl.hpp.

Referenced by lock_existing_records(), and run().

const bool foedus::storage::masstree::SplitBorder::disable_no_record_split_

If true, we never do no-record-split (NRS).

This is useful for example when we want to make room for record-expansion. Otherwise, we get stuck when the record-expansion causes a page-split that is eligible for NRS.

Definition at line 66 of file masstree_split_impl.hpp.

Referenced by decide_strategy(), and run().

const PayloadLength foedus::storage::masstree::SplitBorder::piggyback_payload_count_

Definition at line 79 of file masstree_split_impl.hpp.

const KeyLength foedus::storage::masstree::SplitBorder::piggyback_remainder_length_

Definition at line 78 of file masstree_split_impl.hpp.

const bool foedus::storage::masstree::SplitBorder::piggyback_reserve_

An optimization to also make room for a record.

Whether to do that, key length, payload length, and the key (well, suffix). In this case, trigger_ is implicitly the slice for piggyback_reserve_.

This optimization is best-effort. The caller must check afterwards whether the space is actually reserved. For example, a concurrent thread might have newly reserved a (might be too small) space for the key right before the call.

Definition at line 77 of file masstree_split_impl.hpp.

const void* foedus::storage::masstree::SplitBorder::piggyback_suffix_

Definition at line 80 of file masstree_split_impl.hpp.

MasstreeBorderPage* const foedus::storage::masstree::SplitBorder::target_

The page to split.

Precondition: !header_.snapshot_ (split happens to only volatile pages)

Definition at line 58 of file masstree_split_impl.hpp.

Referenced by decide_strategy(), lock_existing_records(), migrate_records(), and run().

const KeySlice foedus::storage::masstree::SplitBorder::trigger_

The key that triggered this split.

A hint for NRS

Definition at line 60 of file masstree_split_impl.hpp.

Referenced by decide_strategy().

The documentation for this struct was generated from the following files:

/home/shino/foedus_code/foedus-core/include/foedus/storage/masstree/masstree_split_impl.hpp
/home/shino/foedus_code/foedus-core/src/foedus/storage/masstree/masstree_split_impl.cpp

Detailed Description

Classes

Public Member Functions

Public Attributes

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation