diff --git a/src/main/asciidoc/_chapters/architecture.adoc b/src/main/asciidoc/_chapters/architecture.adoc index 073c86b15a33..6c4ec4f30f75 100644 --- a/src/main/asciidoc/_chapters/architecture.adoc +++ b/src/main/asciidoc/_chapters/architecture.adoc @@ -101,7 +101,7 @@ The `hbase:meta` table structure is as follows: .Values -* `info:regioninfo` (serialized link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo] instance for this region) +* `info:regioninfo` (serialized link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/RegionInfo.html[RegionInfo] instance for this region) * `info:server` (server:port of the RegionServer containing this region) * `info:serverstartcode` (start-time of the RegionServer process containing this region) @@ -119,7 +119,7 @@ If a region has both an empty start and an empty end key, it is the only region ==== In the (hopefully unlikely) event that programmatic processing of catalog metadata -is required, see the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/RegionInfo.html#parseFrom-byte:A-[RegionInfo.parseFrom] utility. +is required, see the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/RegionInfo.html#parseFrom(byte%5B%5D)[RegionInfo.parseFrom] utility. [[arch.catalog.startup]] === Startup Sequencing @@ -141,7 +141,7 @@ Should a region be reassigned either by the master load balancer or because a Re See <> for more information about the impact of the Master on HBase Client communication. -Administrative functions are done via an instance of link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin] +Administrative functions are done via an instance of link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Admin.html[Admin] [[client.connections]] === Cluster Connections @@ -161,8 +161,8 @@ See the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/pac ==== API before HBase 1.0.0 -Instances of `HTable` are the way to interact with an HBase cluster earlier than 1.0.0. _link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table] instances are not thread-safe_. Only one thread can use an instance of Table at any given time. -When creating Table instances, it is advisable to use the same link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance. +Instances of `HTable` are the way to interact with an HBase cluster earlier than 1.0.0. _link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html[Table] instances are not thread-safe_. Only one thread can use an instance of Table at any given time. +When creating Table instances, it is advisable to use the same link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want. For example, this is preferred: @@ -183,7 +183,7 @@ HBaseConfiguration conf2 = HBaseConfiguration.create(); HTable table2 = new HTable(conf2, "myTable"); ---- -For more information about how connections are handled in the HBase client, see link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ConnectionFactory.html[ConnectionFactory]. +For more information about how connections are handled in the HBase client, see link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ConnectionFactory.html[ConnectionFactory]. [[client.connection.pooling]] ===== Connection Pooling @@ -207,19 +207,19 @@ try (Connection connection = ConnectionFactory.createConnection(conf); [WARNING] ==== Previous versions of this guide discussed `HTablePool`, which was deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by link:https://issues.apache.org/jira/browse/HBASE-6580[HBASE-6580], or `HConnection`, which is deprecated in HBase 1.0 by `Connection`. -Please use link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Connection.html[Connection] instead. +Please use link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Connection.html[Connection] instead. ==== [[client.writebuffer]] === WriteBuffer and Batch Methods -In HBase 1.0 and later, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] is deprecated in favor of link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]. `Table` does not use autoflush. To do buffered writes, use the BufferedMutator class. +In HBase 1.0 and later, link:https://hbase.apache.org/1.4/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] is deprecated in favor of link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html[Table]. `Table` does not use autoflush. To do buffered writes, use the BufferedMutator class. -In HBase 2.0 and later, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] does not use BufferedMutator to execute the ``Put`` operation. Refer to link:https://issues.apache.org/jira/browse/HBASE-18500[HBASE-18500] for more information. +In HBase 2.0 and later, link:https://hbase.apache.org/2.6/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] does not use BufferedMutator to execute the ``Put`` operation. Refer to link:https://issues.apache.org/jira/browse/HBASE-18500[HBASE-18500] for more information. For additional information on write durability, review the link:/acid-semantics.html[ACID semantics] page. -For fine-grained control of batching of ``Put``s or ``Delete``s, see the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch-java.util.List-java.lang.Object:A-[batch] methods on Table. +For fine-grained control of batching of ``Put``s or ``Delete``s, see the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List,java.lang.Object%5B%5D)[batch] methods on Table. [[async.client]] === Asynchronous Client === @@ -486,7 +486,7 @@ your own risk. [[client.filter]] == Client Request Filters -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] and link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be optionally configured with link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[filters] which are applied on the RegionServer. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Get.html[Get] and link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be optionally configured with link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/Filter.html[filters] which are applied on the RegionServer. Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups of Filter functionality. @@ -498,7 +498,7 @@ Structural Filters contain other Filters. [[client.filter.structural.fl]] ==== FilterList -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList] represents a list of Filters with a relationship of `FilterList.Operator.MUST_PASS_ALL` or `FilterList.Operator.MUST_PASS_ONE` between the Filters. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList] represents a list of Filters with a relationship of `FilterList.Operator.MUST_PASS_ALL` or `FilterList.Operator.MUST_PASS_ONE` between the Filters. The following example shows an 'or' between two Filters (checking for either 'my value' or 'my other value' on the same attribute). [source,java] @@ -528,7 +528,7 @@ scan.setFilter(list); ==== SingleColumnValueFilter A SingleColumnValueFilter (see: -https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html) +https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html) can be used to test column values for equivalence (`CompareOperaor.EQUAL`), inequality (`CompareOperaor.NOT_EQUAL`), or ranges (e.g., `CompareOperaor.GREATER`). The following is an example of testing equivalence of a column to a String value "my value"... @@ -588,7 +588,7 @@ These Comparators are used in concert with other Filters, such as <>. -J Mohamed Zahoor goes into some more detail on the Master Architecture in this blog posting, link:http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster Architecture ]. +J Mohamed Zahoor goes into some more detail on the Master Architecture in this blog posting, link:https://web.archive.org/web/20191211053128/http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster Architecture]. [[master.startup]] === Startup Behavior @@ -1129,7 +1129,7 @@ If the BucketCache is deployed in off-heap mode, this memory is not managed by t This is why you'd use BucketCache in pre-2.0.0, so your latencies are less erratic, to mitigate GCs and heap fragmentation, and so you can safely use more memory. See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 101] for comparisons running on-heap vs off-heap tests. -Also see link:https://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache. +Also see link:https://web.archive.org/web/20231109025243/http://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache. + In pre-2.0.0, one can configure the BucketCache so it receives the `victim` of an LruBlockCache eviction. @@ -1239,7 +1239,7 @@ The value allocated by `MaxDirectMemorySize` must not exceed physical RAM, and i You can see how much memory -- on-heap and off-heap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the _Server Metrics: Memory_ tab in the UI. It can also be gotten via JMX. In particular the direct memory currently used by the server can be found on the `java.nio.type=BufferPool,name=direct` bean. -Terracotta has a link:http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good write up] on using off-heap memory in Java. +Terracotta has a link:https://web.archive.org/web/20170907032911/http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good write up] on using off-heap memory in Java. It is for their product BigMemory but a lot of the issues noted apply in general to any attempt at going off-heap. Check it out. ==== diff --git a/src/main/asciidoc/_chapters/case_studies.adoc b/src/main/asciidoc/_chapters/case_studies.adoc index b021aa204bf7..96b6f2e07d51 100644 --- a/src/main/asciidoc/_chapters/case_studies.adoc +++ b/src/main/asciidoc/_chapters/case_studies.adoc @@ -160,7 +160,7 @@ Investigation results of a self-described "we're not sure what's wrong, but it s === Case Study #3 (Performance Research 2010)) Investigation results of general cluster performance from 2010. -Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. http://hstack.org/hbase-performance-testing/ +Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. https://web.archive.org/web/20180503124332/http://hstack.org/hbase-performance-testing/ [[casestudies.max.transfer.threads]] === Case Study #4 (max.transfer.threads Config) diff --git a/src/main/asciidoc/_chapters/community.adoc b/src/main/asciidoc/_chapters/community.adoc index 5dfc42a77066..7a4b00152313 100644 --- a/src/main/asciidoc/_chapters/community.adoc +++ b/src/main/asciidoc/_chapters/community.adoc @@ -75,7 +75,7 @@ We also are currently in violation of this basic tenet -- replication at least k === Release Managers Each maintained release branch has a release manager, who volunteers to coordinate new features and bug fixes are backported to that release. -The release managers are link:https://hbase.apache.org/team-list.html[committers]. +The release managers are link:https://hbase.apache.org/team.html[committers]. If you would like your feature or bug fix to be included in a given release, communicate with that release manager. If this list goes out of date or you can't reach the listed person, reach out to someone else on the list. diff --git a/src/main/asciidoc/_chapters/configuration.adoc b/src/main/asciidoc/_chapters/configuration.adoc index 8ae6824cb220..a990863900f1 100644 --- a/src/main/asciidoc/_chapters/configuration.adoc +++ b/src/main/asciidoc/_chapters/configuration.adoc @@ -593,7 +593,7 @@ Pseudo-distributed mode can run against the local filesystem or it can run again the _Hadoop Distributed File System_ (HDFS). Fully-distributed mode can ONLY run on HDFS. See the Hadoop link:https://hadoop.apache.org/docs/current/[documentation] for how to set up HDFS. A good walk-through for setting up HDFS on Hadoop 2 can be found at -http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide. +https://web.archive.org/web/20221007121526/https://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/. [[pseudo]] ==== Pseudo-distributed @@ -910,7 +910,7 @@ dependency when connecting to a cluster: ==== Java client configuration The configuration used by a Java client is kept in an -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance. The factory method on HBaseConfiguration, `HBaseConfiguration.create();`, on invocation, will read @@ -1227,7 +1227,7 @@ major compactions. See the entry for `hbase.hregion.majorcompaction` in the ==== Major compactions are absolutely necessary for StoreFile clean-up. Do not disable them altogether. You can run major compactions manually via the HBase shell or via the -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin API]. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact(org.apache.hadoop.hbase.TableName)[Admin API]. ==== For more information about compactions and the compaction file selection process, see @@ -1264,7 +1264,7 @@ idea on the size you need by surveying RegionServer UIs; you'll see index block the top of the webpage). [[nagles]] -==== link:http://en.wikipedia.org/wiki/Nagle's_algorithm[Nagle's] or the small package problem +==== link:https://en.wikipedia.org/wiki/Nagle%27s_algorithm[Nagle's] or the small package problem If a big 40ms or so occasional delay is seen in operations against HBase, try the Nagles' setting. For example, see the user mailing list thread, diff --git a/src/main/asciidoc/_chapters/cp.adoc b/src/main/asciidoc/_chapters/cp.adoc index 54733a792036..285afeb633f3 100644 --- a/src/main/asciidoc/_chapters/cp.adoc +++ b/src/main/asciidoc/_chapters/cp.adoc @@ -61,7 +61,7 @@ coprocessor can severely degrade cluster performance and stability. In HBase, you fetch data using a `Get` or `Scan`, whereas in an RDBMS you use a SQL query. In order to fetch only the relevant data, you filter it using a HBase -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[Filter] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/Filter.html[Filter] , whereas in an RDBMS you use a `WHERE` predicate. After fetching the data, you perform computations on it. This paradigm works well @@ -112,7 +112,7 @@ using HBase Shell. For more details see <>. transparently. The framework API is provided in the -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/coprocessor/package-summary.html[coprocessor] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/package-summary.html[coprocessor] package. == Types of Coprocessors @@ -121,8 +121,8 @@ package. Observer coprocessors are triggered either before or after a specific event occurs. Observers that happen before an event use methods that start with a `pre` prefix, -such as link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html#prePut-org.apache.hadoop.hbase.coprocessor.ObserverContext-org.apache.hadoop.hbase.client.Put-org.apache.hadoop.hbase.wal.WALEdit-org.apache.hadoop.hbase.client.Durability-[`prePut`]. Observers that happen just after an event override methods that start -with a `post` prefix, such as link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html#postPut-org.apache.hadoop.hbase.coprocessor.ObserverContext-org.apache.hadoop.hbase.client.Put-org.apache.hadoop.hbase.wal.WALEdit-org.apache.hadoop.hbase.client.Durability-[`postPut`]. +such as link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html#prePut(org.apache.hadoop.hbase.coprocessor.ObserverContext,org.apache.hadoop.hbase.client.Put,org.apache.hadoop.hbase.wal.WALEdit)[`prePut`]. Observers that happen just after an event override methods that start +with a `post` prefix, such as link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/coprocessor/RegionObserver.html#postPut(org.apache.hadoop.hbase.coprocessor.ObserverContext,org.apache.hadoop.hbase.client.Put,org.apache.hadoop.hbase.wal.WALEdit)[`postPut`]. ==== Use Cases for Observer Coprocessors @@ -178,7 +178,7 @@ average or summation for an entire table which spans hundreds of regions. In contrast to observer coprocessors, where your code is run transparently, endpoint coprocessors must be explicitly invoked using the -link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/AsyncTable.html#coprocessorService-java.util.function.Function-org.apache.hadoop.hbase.client.ServiceCaller-byte:A-[CoprocessorService()] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/AsyncTable.html#coprocessorService(java.util.function.Function,org.apache.hadoop.hbase.client.ServiceCaller,byte%5B%5D)[CoprocessorService()] method available in link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/AsyncTable.html[AsyncTable]. @@ -197,7 +197,7 @@ our async implementation). Since coprocessor is an advanced feature, we believe it is OK for coprocessor users to instead switch over to use `AsyncTable`. There is a lightweight -link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Connection.html#toAsyncConnection--[toAsyncConnection] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Connection.html#toAsyncConnection()[toAsyncConnection] method to get an `AsyncConnection` from `Connection` if needed. ==== diff --git a/src/main/asciidoc/_chapters/datamodel.adoc b/src/main/asciidoc/_chapters/datamodel.adoc index 2e6070dbc90b..e4e488f1835a 100644 --- a/src/main/asciidoc/_chapters/datamodel.adoc +++ b/src/main/asciidoc/_chapters/datamodel.adoc @@ -287,21 +287,21 @@ Cell content is uninterpreted bytes == Data Model Operations The four primary data model operations are Get, Put, Scan, and Delete. -Operations are applied via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table] instances. +Operations are applied via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html[Table] instances. === Get -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] returns attributes for a specified row. -Gets are executed via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get-org.apache.hadoop.hbase.client.Get-[Table.get] +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Get.html[Get] returns attributes for a specified row. +Gets are executed via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)[Table.get] === Put -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html[Put] either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put-org.apache.hadoop.hbase.client.Put-[Table.put] (non-writeBuffer) or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch-java.util.List-java.lang.Object:A-[Table.batch] (non-writeBuffer) +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Put.html[Put] either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)[Table.put] (non-writeBuffer) or link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List,java.lang.Object%5B%5D)[Table.batch] (non-writeBuffer) [[scan]] === Scans -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] allow iteration over multiple rows for specified attributes. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] allow iteration over multiple rows for specified attributes. The following is an example of a Scan on a Table instance. Assume that a table is populated with rows with keys "row1", "row2", "row3", and then another set of rows with the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan instance to return the rows beginning with "row". @@ -328,12 +328,12 @@ try { } ---- -Note that generally the easiest way to specify a specific stop point for a scan is by using the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html[InclusiveStopFilter] class. +Note that generally the easiest way to specify a specific stop point for a scan is by using the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html[InclusiveStopFilter] class. === Delete -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html[Delete] removes a row from a table. -Deletes are executed via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete-org.apache.hadoop.hbase.client.Delete-[Table.delete]. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Delete.html[Delete] removes a row from a table. +Deletes are executed via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)[Table.delete]. HBase does not modify data in place, and so deletes are handled by creating new markers called _tombstones_. These tombstones, along with the dead values, are cleaned up on major compactions. @@ -358,7 +358,7 @@ In particular: * It is OK to write cells in a non-increasing version order. Below we describe how the version dimension in HBase currently works. -See link:https://issues.apache.org/jira/browse/HBASE-2406[HBASE-2406] for discussion of HBase versions. link:https://www.ngdata.com/bending-time-in-hbase/[Bending time in HBase] makes for a good read on the version, or time, dimension in HBase. +See link:https://issues.apache.org/jira/browse/HBASE-2406[HBASE-2406] for discussion of HBase versions. link:https://web.archive.org/web/20160909085951/https://www.ngdata.com/bending-time-in-hbase/[Bending time in HBase] makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limitation _Overwriting values at existing timestamps_ mentioned in the article no longer holds in HBase. @@ -373,7 +373,7 @@ Prior to HBase 0.96, the default number of versions kept was `3`, but in 0.96 an .Modify the Maximum Number of Versions for a Column Family ==== This example uses HBase Shell to keep a maximum of 5 versions of all columns in column family `f1`. -You could also use link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. +You could also use link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder]. ---- hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5 @@ -385,7 +385,7 @@ hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5 You can also specify the minimum number of versions to store per column family. By default, this is set to 0, which means the feature is disabled. The following example sets the minimum number of versions on all columns in column family `f1` to `2`, via HBase Shell. -You could also use link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. +You could also use link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder]. ---- hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2 @@ -403,12 +403,12 @@ In this section we look at the behavior of the version dimension for each of the ==== Get/Scan Gets are implemented on top of Scans. -The below discussion of link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] applies equally to link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scans]. +The below discussion of link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Get.html[Get] applies equally to link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scans]. By default, i.e. if you specify no explicit version, when doing a `get`, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways: -* to return more than one version, see link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions--[Get.setMaxVersions()] -* to return versions other than the latest, see link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setTimeRange-long-long-[Get.setTimeRange()] +* to return more than one version, see link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Get.html#readVersions(int)[Get.readVersions(int)] +* to return versions other than the latest, see link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Get.html#setTimeRange(long,long)[Get.setTimeRange(long,long)] + To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1. @@ -529,7 +529,7 @@ Rather, a so-called _tombstone_ is written, which will mask the deleted values. When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted. -For an informative discussion on how deletes and versioning interact, see the thread link:http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28421[Put w/timestamp -> Deleteall -> Put w/ timestamp fails] up on the user mailing list. +For an informative discussion on how deletes and versioning interact, see the thread link:https://lists.apache.org/thread/g6s0fkx74hbmc0pplnf5r3gq5xn4vkyt[Put w/timestamp -> Deleteall -> Put w/ timestamp fails] up on the user mailing list. Also see <> for more information on the internal KeyValue format. @@ -597,7 +597,7 @@ _...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case - anymore..._ (See _Garbage Collection_ in link:https://www.ngdata.com/bending-time-in-hbase/[Bending time in HBase].) + anymore..._ (See _Garbage Collection_ in link:https://web.archive.org/web/20160909085951/https://www.ngdata.com/bending-time-in-hbase/[Bending time in HBase].) [[dm.sort]] == Sort Order diff --git a/src/main/asciidoc/_chapters/developer.adoc b/src/main/asciidoc/_chapters/developer.adoc index 0af9a42044ba..0844bafd4868 100644 --- a/src/main/asciidoc/_chapters/developer.adoc +++ b/src/main/asciidoc/_chapters/developer.adoc @@ -837,7 +837,7 @@ looking good, now is the time to tag the release candidate (You always remove the tag if you need to redo). To tag, do what follows substituting in the version appropriate to your build. All tags should be signed tags; i.e. pass the _-s_ option (See -link:http://https://git-scm.com/book/id/v2/Git-Tools-Signing-Your-Work[Signing Your Work] +link:https://git-scm.com/book/id/v2/Git-Tools-Signing-Your-Work[Signing Your Work] for how to set up your git environment for signing). [source,bourne] @@ -1208,7 +1208,7 @@ For example, the tests that cover the shell commands for altering tables are con mvn clean test -pl hbase-shell -Dshell.test=/AdminAlterTableTest/ ---- -You may also use a link:http://docs.ruby-doc.com/docs/ProgrammingRuby/html/language.html#UJ[Ruby Regular Expression +You may also use a link:https://docs.ruby-lang.org/en/master/syntax/literals_rdoc.html#label-Regexp+Literals[Ruby Regular Expression literal] (in the `/pattern/` style) to select a set of test cases. You can run all of the HBase admin related tests, including both the normal administration and the security administration, with the command: diff --git a/src/main/asciidoc/_chapters/mapreduce.adoc b/src/main/asciidoc/_chapters/mapreduce.adoc index bba8cc92b941..33f9e7c95429 100644 --- a/src/main/asciidoc/_chapters/mapreduce.adoc +++ b/src/main/asciidoc/_chapters/mapreduce.adoc @@ -30,12 +30,10 @@ Apache MapReduce is a software framework used to analyze large amounts of data. It is provided by link:https://hadoop.apache.org/[Apache Hadoop]. MapReduce itself is out of the scope of this document. A good place to get started with MapReduce is https://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html. -MapReduce version 2 (MR2)is now part of link:https://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/[YARN]. +MapReduce version 2 (MR2)is now part of link:https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/[YARN]. This chapter discusses specific configuration steps you need to take to use MapReduce on data within HBase. -In addition, it discusses other interactions and issues between HBase and MapReduce -jobs. Finally, it discusses <>, an -link:http://www.cascading.org/[alternative API] for MapReduce. +In addition, it discusses other interactions and issues between HBase and MapReduce jobs. .`mapred` and `mapreduce` [NOTE] @@ -70,7 +68,7 @@ job runner letting hbase utility pick out from the full-on classpath what it nee MapReduce job configuration (See the source at `TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)` for how this is done). -The following example runs the bundled HBase link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job against a table named `usertable`. +The following example runs the bundled HBase link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job against a table named `usertable`. It sets into `HADOOP_CLASSPATH` the jars hbase needs to run in an MapReduce context (including configuration files such as hbase-site.xml). Be sure to use the correct version of the HBase JAR for your system; replace the VERSION string in the below command line w/ the version of your local hbase install. The backticks (``` symbols) cause the shell to execute the sub-commands, setting the output of `hbase classpath` into `HADOOP_CLASSPATH`. @@ -229,7 +227,7 @@ If you think of the scan as a shovel, a bigger cache setting is analogous to a b The list of priorities mentioned above allows you to set a reasonable default, and override it for specific operations. -See the API documentation for link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] for more details. +See the API documentation for link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] for more details. == Bundled HBase MapReduce Jobs @@ -259,10 +257,10 @@ $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar rowcou == HBase as a MapReduce Job Data Source and Data Sink -HBase can be used as a data source, link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat], and data sink, link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat], for MapReduce jobs. -Writing MapReduce jobs that read or write HBase, it is advisable to subclass link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper] and/or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html[TableReducer]. -See the do-nothing pass-through classes link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper] and link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer] for basic usage. -For a more involved example, see link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` unit test. +HBase can be used as a data source, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat], and data sink, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] or link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat], for MapReduce jobs. +Writing MapReduce jobs that read or write HBase, it is advisable to subclass link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper] and/or link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html[TableReducer]. +See the do-nothing pass-through classes link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper] and link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer] for basic usage. +For a more involved example, see link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` unit test. If you run MapReduce jobs that use HBase as source or sink, need to specify source and sink table and column names in your configuration. @@ -275,7 +273,7 @@ On insert, HBase 'sorts' so there is no point double-sorting (and shuffling data If you do not need the Reduce, your map might emit counts of records processed for reporting at the end of the job, or set the number of Reduces to zero and use TableOutputFormat. If running the Reduce step makes sense in your case, you should typically use multiple reducers so that load is spread across the HBase cluster. -A new HBase partitioner, the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html[HRegionPartitioner], can run as many reducers the number of existing regions. +A new HBase partitioner, the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html[HRegionPartitioner], can run as many reducers the number of existing regions. The HRegionPartitioner is suitable when your table is large and your upload will not greatly alter the number of existing regions upon completion. Otherwise use the default partitioner. @@ -286,7 +284,7 @@ For more on how this mechanism works, see <>. == RowCounter Example -The included link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job uses `TableInputFormat` and does a count of all rows in the specified table. +The included link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] MapReduce job uses `TableInputFormat` and does a count of all rows in the specified table. To run it, use the following command: [source,bash] @@ -306,13 +304,13 @@ If you have classpath errors, see <>. [[splitter.default]] === The Default HBase MapReduce Splitter -When link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat] is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. +When link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat] is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan. [[splitter.custom]] === Custom Splitters -For those interested in implementing custom splitters, see the method `getSplits` in link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase]. +For those interested in implementing custom splitters, see the method `getSplits` in link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase]. That is where the logic for map-task assignment resides. [[mapreduce.example]] @@ -352,7 +350,7 @@ if (!b) { } ---- -...and the mapper instance would extend link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper]... +...and the mapper instance would extend link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper]... [source,java] ---- @@ -400,7 +398,7 @@ if (!b) { } ---- -An explanation is required of what `TableMapReduceUtil` is doing, especially with the reducer. link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] is being used as the outputFormat class, and several parameters are being set on the config (e.g., `TableOutputFormat.OUTPUT_TABLE`), as well as setting the reducer output key to `ImmutableBytesWritable` and reducer value to `Writable`. +An explanation is required of what `TableMapReduceUtil` is doing, especially with the reducer. link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat] is being used as the outputFormat class, and several parameters are being set on the config (e.g., `TableOutputFormat.OUTPUT_TABLE`), as well as setting the reducer output key to `ImmutableBytesWritable` and reducer value to `Writable`. These could be set by the programmer on the job and conf, but `TableMapReduceUtil` tries to make things easier. The following is the example mapper, which will create a `Put` and matching the input `Result` and emit it. @@ -640,50 +638,3 @@ This can either be done on a per-Job basis through properties, or on the entire Especially for longer running jobs, speculative execution will create duplicate map-tasks which will double-write your data to HBase; this is probably not what you want. See <> for more information. - -[[cascading]] -== Cascading - -link:http://www.cascading.org/[Cascading] is an alternative API for MapReduce, which -actually uses MapReduce, but allows you to write your MapReduce code in a simplified -way. - -The following example shows a Cascading `Flow` which "sinks" data into an HBase cluster. The same -`hBaseTap` API could be used to "source" data as well. - -[source, java] ----- -// read data from the default filesystem -// emits two fields: "offset" and "line" -Tap source = new Hfs( new TextLine(), inputFileLhs ); - -// store data in an HBase cluster -// accepts fields "num", "lower", and "upper" -// will automatically scope incoming fields to their proper familyname, "left" or "right" -Fields keyFields = new Fields( "num" ); -String[] familyNames = {"left", "right"}; -Fields[] valueFields = new Fields[] {new Fields( "lower" ), new Fields( "upper" ) }; -Tap hBaseTap = new HBaseTap( "multitable", new HBaseScheme( keyFields, familyNames, valueFields ), SinkMode.REPLACE ); - -// a simple pipe assembly to parse the input into fields -// a real app would likely chain multiple Pipes together for more complex processing -Pipe parsePipe = new Each( "insert", new Fields( "line" ), new RegexSplitter( new Fields( "num", "lower", "upper" ), " " ) ); - -// "plan" a cluster executable Flow -// this connects the source Tap and hBaseTap (the sink Tap) to the parsePipe -Flow parseFlow = new FlowConnector( properties ).connect( source, hBaseTap, parsePipe ); - -// start the flow, and block until complete -parseFlow.complete(); - -// open an iterator on the HBase table we stuffed data into -TupleEntryIterator iterator = parseFlow.openSink(); - -while(iterator.hasNext()) - { - // print out each tuple from HBase - System.out.println( "iterator.next() = " + iterator.next() ); - } - -iterator.close(); ----- diff --git a/src/main/asciidoc/_chapters/ops_mgt.adoc b/src/main/asciidoc/_chapters/ops_mgt.adoc index 609b22980f29..f6cce798dec7 100644 --- a/src/main/asciidoc/_chapters/ops_mgt.adoc +++ b/src/main/asciidoc/_chapters/ops_mgt.adoc @@ -653,7 +653,7 @@ and doPuts to false would give same effect as setting dryrun to true. [NOTE] ==== "doDeletes/doPuts" were only added by -link:https://jira.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on +link:https://issues.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on all released versions. For major 1.x versions, minimum minor release including it is *1.4.10*. For major 2.x versions, minimum minor release including it is *2.1.5*. @@ -687,7 +687,7 @@ which does not give any meaningful result. [NOTE] ==== Often, remote clusters may be deployed on different Kerberos Realms. -link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for +link:https://issues.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for cross realm authentication, allowing a SyncTable process running on target cluster to connect to source cluster and read both HashTable output files and the given HBase table when performing the required comparisons. @@ -969,7 +969,7 @@ For performance also consider the following options: [[rowcounter]] === RowCounter -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit. It is possible to limit the time range of data to be scanned by using the `--starttime=[starttime]` and `--endtime=[endtime]` flags. @@ -986,7 +986,7 @@ For performance consider to use `-Dhbase.client.scanner.caching=100` and `-Dmapr [[cellcounter]] === CellCounter -HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter]. +HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter]. Like RowCounter, it gathers more fine-grained statistics about your table. The statistics gathered by CellCounter are more fine-grained and include: @@ -1330,7 +1330,7 @@ $ bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool [[ops.regionmgt.majorcompact]] === Major Compaction -Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact]. +Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact(org.apache.hadoop.hbase.TableName)[Admin.majorCompact]. Note: major compactions do NOT do region merges. See <> for more information about compactions. @@ -2415,7 +2415,7 @@ A single WAL edit goes through several steps in order to be replicated to a slav . The edit is tagged with the master's UUID and added to a buffer. When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster. . The region server reads the edits sequentially and separates them into buffers, one buffer per table. - After all edits are read, each buffer is flushed using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client. + After all edits are read, each buffer is flushed using link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client. The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops. . In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper. @@ -3269,7 +3269,7 @@ SNAPSHOT SIZE There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has pros and cons. -For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup +For additional information, see link:https://web.archive.org/web/20160110232448/http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup Options] over on the Sematext Blog. [[ops.backup.fullshutdown]] @@ -3686,7 +3686,7 @@ See <>, <> and elsewhere (TODO: Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range. The number of regions cannot be configured directly (unless you go for fully <>); adjust the region size to achieve the target region size given table size. -When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands. +When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/TableDescriptorBuilder.html[TableDescriptorBuilder], as well as shell commands. These settings will override the ones in `hbase-site.xml`. That is useful if your tables have different workloads/use cases. @@ -4107,7 +4107,7 @@ when the schema includes timestamp in the rowkey, as it will automatically merge away regions whose contents have expired. (The bulk of the below detail was copied wholesale from the blog by Romil Choksi at -link:https://community.hortonworks.com/articles/54987/hbase-region-normalizer.html[HBase Region Normalizer]). +link:https://community.cloudera.com/t5/Community-Articles/HBase-Region-Normalizer/ta-p/247266[HBase Region Normalizer]). The Region Normalizer is feature available since HBase-1.2. It runs a set of pre-calculated merge/split actions to resize regions that are either too diff --git a/src/main/asciidoc/_chapters/performance.adoc b/src/main/asciidoc/_chapters/performance.adoc index d7f18f59c1b3..170c047c139d 100644 --- a/src/main/asciidoc/_chapters/performance.adoc +++ b/src/main/asciidoc/_chapters/performance.adoc @@ -318,7 +318,7 @@ See also <> for compression caveats. [[schema.regionsize]] === Table RegionSize -The regionsize can be set on a per-table basis via `setFileSize` on link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor] in the event where certain tables require different regionsizes than the configured default regionsize. +The regionsize can be set on a per-table basis via `setMaxFileSize` on link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/TableDescriptorBuilder.html[TableDescriptorBuilder] in the event where certain tables require different regionsizes than the configured default regionsize. See <> for more information. @@ -370,7 +370,7 @@ Bloom filters are enabled on a Column Family. You can do this by using the setBloomFilterType method of HColumnDescriptor or using the HBase API. Valid values are `NONE`, `ROW` (default), or `ROWCOL`. See <> for more information on `ROW` versus `ROWCOL`. -See also the API documentation for link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. +See also the API documentation for link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder]. The following example creates a table and enables a ROWCOL Bloom filter on the `colfam1` column family. @@ -429,7 +429,7 @@ The blocksize can be configured for each ColumnFamily in a table, and defaults t Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved). -See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] and <>for more information. +See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder] and <>for more information. [[cf.in.memory]] === In-Memory ColumnFamilies @@ -438,7 +438,7 @@ ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the <>, but it is not a guarantee that the entire table will be in memory. -See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] for more information. +See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder] for more information. [[perf.compression]] === Compression @@ -547,7 +547,7 @@ If deferred log flush is used, WAL edits are kept in memory until the flush peri The benefit is aggregated and asynchronous `WAL`- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts. -Deferred log flush can be configured on tables via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor]. +Deferred log flush can be configured on tables via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/TableDescriptorBuilder.html[TableDescriptorBuilder]. The default value of `hbase.regionserver.optionallogflushinterval` is 1000ms. [[perf.hbase.client.putwal]] @@ -572,7 +572,7 @@ There is a utility `HTableUtil` currently on MASTER that does this, but you can [[perf.hbase.write.mr.reducer]] === MapReduce: Skip The Reducer -When writing a lot of data to an HBase table from a MR job (e.g., with link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. +When writing a lot of data to an HBase table from a MR job (e.g., with link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase. @@ -585,7 +585,7 @@ If all your data is being written to one region at a time, then re-read the sect Also, if you are pre-splitting regions and all your data is _still_ winding up in a single region even though your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy. There are a variety of reasons that regions may appear "well split" but won't work with your data. -As the HBase client communicates directly with the RegionServers, this can be obtained via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/RegionLocator.html#getRegionLocation-byte:A-[RegionLocator.getRegionLocation]. +As the HBase client communicates directly with the RegionServers, this can be obtained via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/RegionLocator.html#getRegionLocation(byte%5B%5D)[RegionLocator.getRegionLocation]. See <>, as well as <> @@ -597,7 +597,7 @@ The mailing list can help if you are having performance issues. [[perf.hbase.client.caching]] === Scan Caching -If HBase is used as an input source for a MapReduce job, for example, make sure that the input link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instance to the MapReduce job has `setCaching` set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. +If HBase is used as an input source for a MapReduce job, for example, make sure that the input link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instance to the MapReduce job has `setCaching` set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better. @@ -646,7 +646,7 @@ For MapReduce jobs that use HBase tables as a source, if there a pattern where t === Close ResultScanners This isn't so much about improving performance but rather _avoiding_ performance problems. -If you forget to close link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html[ResultScanners] you can cause problems on the RegionServers. +If you forget to close link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ResultScanner.html[ResultScanners] you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks. [source,java] @@ -666,7 +666,7 @@ table.close(); [[perf.hbase.client.blockcache]] === Block Cache -link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be set to use the block cache in the RegionServer via the `setCacheBlocks` method. +link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be set to use the block cache in the RegionServer via the `setCacheBlocks` method. For input Scans to MapReduce jobs, this should be `false`. For frequently accessed rows, it is advisable to use the block cache. @@ -676,8 +676,8 @@ See <> [[perf.hbase.client.rowkeyonly]] === Optimal Loading of Row Keys -When performing a table link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan] where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner using `setFilter`. -The filter list should include both a link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter] and a link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter]. +When performing a table link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html[scan] where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner using `setFilter`. +The filter list should include both a link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter] and a link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter]. Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk and minimal network traffic to the client for a single row. [[perf.hbase.read.dist]] @@ -815,7 +815,7 @@ In this case, special care must be taken to regularly perform major compactions As is documented in <>, marking rows as deleted creates additional StoreFiles which then need to be processed on reads. Tombstones only get cleaned up with major compactions. -See also <> and link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact]. +See also <> and link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact(org.apache.hadoop.hbase.TableName)[Admin.majorCompact]. [[perf.deleting.rpc]] === Delete RPC Behavior @@ -824,7 +824,7 @@ Be aware that `Table.delete(Delete)` doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. For a large number of deletes, consider `Table.delete(List)`. -See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete-org.apache.hadoop.hbase.client.Delete-[hbase.client.Delete] +See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)[hbase.client.Delete] [[perf.hdfs]] == HDFS diff --git a/src/main/asciidoc/_chapters/rpc.adoc b/src/main/asciidoc/_chapters/rpc.adoc index 1926c6fa116d..4df337e7089c 100644 --- a/src/main/asciidoc/_chapters/rpc.adoc +++ b/src/main/asciidoc/_chapters/rpc.adoc @@ -70,7 +70,7 @@ Optionally, Cells(KeyValues) can be passed outside of protobufs in follow-behind (because link:https://docs.google.com/document/d/1WEtrq-JTIUhlnlnvA0oYRLp0F8MKpEBeBSCFcQiacdw/edit#[we can't protobuf megabytes of KeyValues] or Cells). These CellBlocks are encoded and optionally compressed. For more detail on the protobufs involved, see the -link:https://github.com/apache/hbase/blob/master/hbase-protocol/src/main/protobuf/RPC.proto[RPC.proto] file in master. +link:https://github.com/apache/hbase/blob/master/hbase-protocol-shaded/src/main/protobuf/rpc/RPC.proto[RPC.proto] file in master. ==== Connection Setup diff --git a/src/main/asciidoc/_chapters/schema_design.adoc b/src/main/asciidoc/_chapters/schema_design.adoc index e2ab0586e256..f6310bd20f43 100644 --- a/src/main/asciidoc/_chapters/schema_design.adoc +++ b/src/main/asciidoc/_chapters/schema_design.adoc @@ -47,7 +47,7 @@ See also Robert Yokota's link:https://blogs.apache.org/hbase/entry/hbase-applica [[schema.creation]] == Schema Creation -HBase schemas can be created or updated using the <> or by using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin] in the Java API. +HBase schemas can be created or updated using the <> or by using link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Admin.html[Admin] in the Java API. Tables must be disabled when making ColumnFamily modifications, for example: @@ -339,7 +339,7 @@ This is the main trade-off. ==== link:https://issues.apache.org/jira/browse/HBASE-4811[HBASE-4811] implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. -See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed-boolean-[Scan.setReversed()] for more information. +See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed(boolean)[Scan.setReversed()] for more information. ==== A common problem in database processing is quickly finding the most recent version of a value. @@ -444,14 +444,14 @@ It is not recommended setting the number of max versions to an exceedingly high [[schema.minversions]] === Minimum Number of Versions -Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]. +Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/ColumnFamilyDescriptorBuilder.html[ColumnFamilyDescriptorBuilder]. The default for min versions is 0, which means the feature is disabled. The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, _but keep at least M versions around_" (where M is the value for minimum number of row versions, M