
HBase handles basically two kinds of file types.
One is used for the write-ahead log
and the other for the actual data storage.
The files are primarily handled by the HRegionServer’s.
But in certain scenarios even the Hmaster will have to perform low-level file operations. You may also notice that the actual files are in fact divided up into smaller blocks
when stored within the Hadoop Distributed Filesystem (HDFS).
This is also one of the areas where you can configure the system
to handle larger or smaller data better.
The general flow is that a new client contacts the Zookeeper quorum
(a separate cluster of Zookeeper nodes) first to find a particular row key.
It does so by retrieving the server name (i.e. host name) that hosts the -ROOT- region from Zookeeper.
With that information it can query that server to get the server that hosts the META table.
Both of these two details are cached and only looked up once.
Lastly it can query the .META. server and retrieve the server that has the row the client is looking for.
- Once it has been told where the row resides, i.e. in what region,
it caches this information as well and contacts the HRegionServer hosting that region directly. - So over time the client has a pretty complete picture of where to get rows from without needing to query the .META. server again.
- Note: The HMaster is responsible to assign the regions to each HRegionServer when you start HBase. This also includes the “special” -ROOT- and .META. tables.
- Next the HRegionServer opens the region it creates a corresponding HRegion object. When the HRegion is “opened” it sets up aStore instance for each HColumnFamily for every table as defined by the user beforehand.
- Each of the Store instances can in turn have one or more StoreFile instances, which are lightweight wrappers around the actual storage file called HFile.
- A HRegion also has a MemStore and a HLog instance.
- So how is data written to the actual storage?
- The client issues a HTable.put(Put) request to the HRegionServer which hands the details to the matching HRegion instance.
- he first step is now to decide if the data should be first written to the “Write-Ahead-Log” (WAL) represented by the HLog class.
- The decision is based on the flag set by the client usingPut.writeToWAL(boolean) method.
- The WAL is a standard Hadoop SequenceFile (although it is currently discussed if that should not be changed to a more HBase suitable file format) and it stores HLogKey’s. These keys contain a sequential number as well as the actual data and are used to replay not yet persisted data after a server crash.
- Once the data is written (or not) to the WAL it is placed in the MemStore.
- At the same time it is checked if the MemStore is full and in that case a flush to disk is requested.
- When the request is served by a separate thread in the HRegionServer it writes the data to an HFile located in the HDFS.
- It also saves the last written sequence number so the system knows what was persisted so far.
- HFile
- So we are now at a very low level of HBase’s architecture. HFile’s are the actual storage files, specifically created to serve one purpose: store HBase’s data fast and efficiently. They are apparently based on Hadoop’sTFile and mimic the SSTable format used in Googles BigTable architecture. The previous use of Hadoop’s MapFile’s in HBase proved to be not good enough performance wise. So how do the files look like?
- The files have a variable length, the only fixed blocks are the FileInfo and Trailer block. As the picture shows it is the Trailer that has the pointers to the other blocks and it is written at the end of persisting the data to the file, finalizing the now immutable data store. The Index blocks record the offsets of the Data and Meta blocks. Both the Data and the Meta blocks are actually optional. But you most likely you would always find data in a data store file.
- How is the block size configured? It is driven solely by the HColumnDescriptor which in turn is specified at table creation time by the user or defaults to reasonable standard values. Here is an example as shown in the master web based interface:
- {NAME => ‘docs’, FAMILIES => [{NAME => ‘cache’, COMPRESSION => ‘NONE’, VERSIONS => ‘3’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘false’}, {NAME => ‘contents’, COMPRESSION => ‘NONE’, VERSIONS => ‘3’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘false’}, …
- The default is “64KB” (or 65535 bytes). Here is what the HFile JavaDoc explains:
- “Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB for general usage. Larger block size is preferred if files are primarily for sequential access. However, it would lead to inefficient random access (because there are more data to decompress). Smaller blocks are good for random access, but require more memory to hold the block index, and may be slower to create (because we must flush the compressor stream at the conclusion of each data block, which leads to an FS I/O flush). Further, due to the internal caching in Compression codec, the smallest possible block size would be around 20KB-30KB.”
- So each block with its prefixed “magic” header contains either plain or compressed data.
- KeyValue’s
- In essence each KeyValue in the HFile is simply a low-level byte array that allows for “zero-copy” access to the data, even with lazy or custom parsing if necessary. How are the instances arranged?
- The structure starts with two fixed length numbers indicating the size of the key and the value part. With that info you can offset into the array to for example get direct access to the value, ignoring the key – if you know what you are doing. Otherwise you can get the required information from the key part. Once parsed into a KeyValue object you have getters to access the details.
- Note: One thing to watch out for is the difference between KeyValue.getKey() and KeyValue.getRow(). I think for me the confusion arose from referring to “row keys” as the primary key to get a row out of HBase. That would be the latter of the two methods, i.e. KeyValue.getRow(). The former simply returns the complete byte array part representing the raw “key” as colored and labeled in the diagram.
- What is MemStore ?
Ans :The MemStore is a write buffer where HBase accumulates data in memory before a permanent write.
Its contents are flushed to disk to form an HFile when the MemStore fills up.
It doesn’t write to an existing HFile but instead forms a new file on every flush. There is one MemStore per column family. (The size of the MemStore is defined by the system-wide property in
hbase-site.xml called hbase.hregion.memstore.flush.size)
-
- When you persist the data in HBase Row, In which tow places HBase writes the data to make sure the durability.
Ans :HBase receives the command and persists the change, or throws an exception if the write fails.
When a write is made, by default, it goes into two places:
a. the write-ahead log (WAL), also referred to as the HLog
b. and the MemStore
The default behavior of HBase recording the write in both places is in order to maintain data durability. Only after the change is written to and confirmed in both places is the write considered complete.
- When you persist the data in HBase Row, In which tow places HBase writes the data to make sure the durability.
-
- What is HFile ?
Ans :The HFile is the underlying storage format for HBase.
HFiles belong to a column family and a column family can have multiple HFiles.
But a single HFile can’t have data for multiple column families.
- What is HFile ?
- How HBase Handles the write failure.
Ans:Failures are common in large distributed systems, and HBase is no exception.
Imagine that the server hosting a MemStore that has not yet been flushed crashes. You’ll lose the data that was in memory but not yet persisted. HBase safeguards against that by writing to the WAL before the write completes. Every server that’s part of the.
HBase cluster keeps a WAL to record changes as they happen. The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written. This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes down, the data that was not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.
why to use Hbase?
- High capacity storage system
- Distributed design to cater large tables
- Column-Oriented Stores
- Horizontally Scalable
- High performance & Availability
- Base goal of Hbase is millions of columns, thousands of versions and billions of rows
- Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
- what are the key components of Hbase?
- Zookeeper:It does the co-ordination work between client and Hbase Maser
- Hbase Master:Hbase Master monitors the Region Server
- RegionServer:RegionServer monitors the Region
- Region:It contains in memory data store(MemStore) and Hfile.
- Catalog Tables:Catalog tables consist of ROOT and META
- how many operational commands in Hbase?
- Operational command in Hbases is about five types
- Get
- Put
- Delete
- Scan
- Increment
- what is WAL and Hlog in Hbase?
- WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.
- When you should use Hbase?
- Data size is huge: When you have tons and millions of records to operate
- Complete Redesign: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere just changing the ports
- SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
- Infrastructure Investment: You need to have enough cluster for Hbase to be really useful
- In Hbase what is column families?
- Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.
- Explain what is the row key?
- Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
- deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?
- When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. Hbase deleted are actually removed during compactions.
- Three types of tombstone markers are there:
- Version delete marker: For deletion, it marks a single version of a column
- Column delete marker: For deletion, it marks all the versions of a column
- Family delete marker: For deletion, it marks of all column for a column family
- How does Hbase actually delete a row?
- In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.
- Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.
- what happens if you alter the block size of a column family on an already occupied database?
- When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.
Mention the difference between Hbase and Relational Database?
| Hbase | Relational Database |
| •It is schema-less
•It is a column-oriented data store •It is used to store de-normalized data •It contains sparsely populated tables •Automated partitioning is done in Hbase |
• It is a schema based database
•It is a row-oriented data store •It is used to store normalized data •It contains thin tables •There is no such provision or built-in support for partitioning |
Hive On HBASE
Grab some data and register it in Hive
$ mkdir pagecounts ;
cd pagecounts
for x in {0..9} ; do wget “http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-01/pagecounts-20080101-100000.gz”; done
$ hadoop fs -copyFromLocal $(pwd) ./
==============================================
Data will look like this
$ zcat pagecounts-20080101-100000.gz | ‘
language code,page name, number of page views,size of the page in bytes
=================================================
Use a DDL script that looks like this.
$ cat 00_pagecounts.ddl
— define an external table over raw pagecounts data
–run this command on Hive
CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘ ‘
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION ‘/user/test/Amar/pagecounts’;
==================================================
select count(*) from pagecount;
3343693
zcat * | wc -l
21303940
=====================================================
Transform the schema for HBase
The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.
CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
SELECT concat_ws(‘/’,
projectcode,
concat_ws(‘/’,
pagename,
regexp_extract(INPUT__FILE__NAME, ‘pagecounts-(\\d{8}-\\d{6})\\..*$’, 1))),
pageviews, bytes
FROM pagecounts;
The SELECT statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the ‘/’ character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.
================================================
Make sure it works by querying Hive for a subset of the data.
$ hive -e “SELECT * FROM pgc WHERE rowkey LIKE ‘en/q%’ LIMIT 10;”
===================================================
Register the HBase table
Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.
CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘pagecounts’);
This statement will tell Hive to go create an HBase table named pagecounts with the single column family f. It registers that HBase table in the Hive metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews, and bytes. The SerDe property hbase.columns.mapping makes the association from Hive column to HBase column. It says the Hive column rowkey is mapped to the HBase table’s rowkey, the Hive column pageviews to the HBase column f:c1, and bytes to the HBase column f:c2. To keep the example simple, we have Hive treat all these columns as the STRING type.
In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process Do that by specifying a value for the HADOOP_CLASSPATH environment variable before executing the statement.
===========================================================
Populate the HBase table
Now it’s time to write data to HBase. This is done using a regular Hive INSERT statement, sourcing data from the view with SELECT. There’s one more bit of administration we need to take care of though. This INSERT statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.
FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE ‘en/q%’ LIMIT 10;
==========================================
Query data from HBase-land
40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.
scan ‘pagecounts’
=================================
Verify data from from Hive
The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler works both ways, after all.
Note that this command expects that the HADOOP_CLASSPATH is still set and HIVE_AUX_JARS_PATH as well if your query is complex.
SELECT * from pagecounts_hbase;
===============================================================
if i update in hbase pagecounts
put ‘pagecounts’,’en/q:Special:Search/Jazz/20080101-100000′,’f:c1′,’2′
get ‘pagecounts’,’en/q:Special:Search/Jazz/20080101-100000′
====================================
HBase shell commands are mainly categorized into 6 parts
====================================
1) General HBase shell commands
| status | Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’.hbase> status hbase> status ‘simple’ hbase> status ‘summary’ hbase> status ‘detailed’ |
| version | Output this HBase versionUsage:
hbase> version |
| whoami | Show the current hbase user.Usage:
hbase> whoami |
2) Tables Management commands
| alter | Alter column family schema; pass table name and a dictionary specifying new column family schema. Dictionaries are described on the main help command output. Dictionary must include name of column family to alter.For example, to change or add the ‘f1’ column family in table ‘t1’ from current value to keep a maximum of 5 cell VERSIONS, do:hbase> alter ‘t1’, NAME => ‘f1’, VERSIONS => 5You can operate on several column families:hbase> alter ‘t1’, ‘f1’, {NAME => ‘f2’, IN_MEMORY => true}, {NAME => ‘f3’, VERSIONS => 5} To delete the ‘f1’ column family in table ‘t1’, use one of:hbase> alter ‘t1’, NAME => ‘f1’, METHOD => ‘delete’ You can also change table-scope attributes like MAX_FILESIZE, READONLY, hbase> alter ‘t1’, MAX_FILESIZE => ‘134217728’ You can add a table coprocessor by setting a table coprocessor attribute: hbase> alter ‘t1’, Since you can have multiple coprocessors configured for a table, a The coprocessor attribute must match the pattern below in order for [coprocessor jar file location] | class name | [priority] | [arguments] You can also set configuration settings specific to this table or column family: hbase> alter ‘t1’, CONFIGURATION => {‘hbase.hregion.scan.loadColumnFamiliesOnDemand’ => ‘true’} You can also remove a table-scope attribute: hbase> alter ‘t1’, METHOD => ‘table_att_unset’, NAME => ‘MAX_FILESIZE’ hbase> alter ‘t1’, METHOD => ‘table_att_unset’, NAME => ‘coprocessor$1’ There could be more than one alteration in one command: hbase> alter ‘t1’, { NAME => ‘f1’, VERSIONS => 3 }, |
| create | Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration.hbase> create ‘t1’, {NAME => ‘f1’, VERSIONS => 5} hbase> create ‘t1’, {NAME => ‘f1’}, {NAME => ‘f2’}, {NAME => ‘f3’} hbase> # The above in shorthand would be the following: hbase> create ‘t1’, ‘f1’, ‘f2’, ‘f3’ hbase> create ‘t1’, {NAME => ‘f1’, VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true} hbase> create ‘t1’, {NAME => ‘f1’, CONFIGURATION => {‘hbase.hstore.blockingStoreFiles’ => ’10’}}Table configuration options can be put at the end. |
| describe | Describe the named table.
hbase> describe ‘t1’ |
| disable | Start disable of named table
hbase> disable ‘t1’ |
| disable_all | Disable all of tables matching the given regex
hbase> disable_all ‘t.*’ |
| is_disabled | verifies Is named table disabled
hbase> is_disabled ‘t1’ |
| drop | Drop the named table. Table must first be disabled
hbase> drop ‘t1’ |
| drop_all | Drop all of the tables matching the given regex
hbase> drop_all ‘t.*’ |
| enable | Start enable of named table
hbase> enable ‘t1’ |
| enable_all | Enable all of the tables matching the given regex
hbase> enable_all ‘t.*’ |
| is_enabled | verifies Is named table enabled
hbase> is_enabled ‘t1’ |
| exists | Does the named table exist
hbase> exists ‘t1’ |
| list | List all tables in hbase. Optional regular expression parameter could be used to filter the outputhbase> list hbase> list ‘abc.*’ |
| show_filters | Show all the filters in hbase.
hbase> show_filters |
| alter_status | Get the status of the alter command. Indicates the number of regions of the table that have received the updated schema Pass table name.
hbase> alter_status ‘t1’ |
| alter_async | Alter column family schema, does not wait for all regions to receive the schema changes. Pass table name and a dictionary specifying new column family schema. Dictionaries are described on the main help command output. Dictionary must include name of column family to alter.To change or add the ‘f1’ column family in table ‘t1’ from defaults to instead keep a maximum of 5 cell VERSIONS, do:hbase> alter_async ‘t1’, NAME => ‘f1’, VERSIONS => 5To delete the ‘f1’ column family in table ‘t1’, do:hbase> alter_async ‘t1’, NAME => ‘f1’, METHOD => ‘delete’or a shorter version:hbase> alter_async ‘t1’, ‘delete’ => ‘f1’You can also change table-scope attributes like MAX_FILESIZE MEMSTORE_FLUSHSIZE, READONLY, and DEFERRED_LOG_FLUSH. For example, to change the max size of a family to 128MB, do: hbase> alter ‘t1’, METHOD => ‘table_att’, MAX_FILESIZE => ‘134217728’ There could be more than one alteration in one command: hbase> alter ‘t1’, {NAME => ‘f1’}, {NAME => ‘f2’, METHOD => ‘delete’} To check if all the regions have been updated, use alter_status <table_name> |
3) Data Manipulation commands
| count | Count the number of rows in a table. Return value is the number of rows. This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount’ to run a counting mapreduce job). Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows. If your rows are small in size, you may want to increase this parameter. Examples:hbase> count ‘t1’ hbase> count ‘t1’, INTERVAL => 100000 hbase> count ‘t1’, CACHE => 1000 hbase> count ‘t1’, INTERVAL => 10, CACHE => 1000The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding commands would be:hbase> t.count hbase> t.count INTERVAL => 100000 hbase> t.count CACHE => 1000 hbase> t.count INTERVAL => 10, CACHE => 1000 |
| delete | Put a delete cell value at specified table/row/column and optionally timestamp coordinates. Deletes must match the deleted cell’s coordinates exactly. When scanning, a delete cell suppresses older versions. To delete a cell from ‘t1’ at row ‘r1’ under column ‘c1’ marked with the time ‘ts1’, do:hbase> delete ‘t1’, ‘r1’, ‘c1’, ts1The same command can also be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding command would be:hbase> t.delete ‘r1’, ‘c1’, ts1 |
| deleteall | Delete all cells in a given row; pass a table name, row, and optionally a column and timestamp. Examples:hbase> deleteall ‘t1’, ‘r1’ hbase> deleteall ‘t1’, ‘r1’, ‘c1’ hbase> deleteall ‘t1’, ‘r1’, ‘c1’, ts1The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding command would be:hbase> t.deleteall ‘r1’ hbase> t.deleteall ‘r1’, ‘c1’ hbase> t.deleteall ‘r1’, ‘c1’, ts1 |
| get | Get row or cell contents; pass table name, row, and optionally a dictionary of column(s), timestamp, timerange and versions. Examples:hbase> get ‘t1’, ‘r1’ hbase> get ‘t1’, ‘r1’, {TIMERANGE => [ts1, ts2]} hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’} hbase> get ‘t1’, ‘r1’, {COLUMN => [‘c1’, ‘c2’, ‘c3’]} hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMESTAMP => ts1} hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMERANGE => [ts1, ts2], VERSIONS => 4} hbase> get ‘t1’, ‘r1’, {COLUMN => ‘c1’, TIMESTAMP => ts1, VERSIONS => 4} hbase> get ‘t1’, ‘r1’, {FILTER => “ValueFilter(=, ‘binary:abc’)”} hbase> get ‘t1’, ‘r1’, ‘c1’ hbase> get ‘t1’, ‘r1’, ‘c1’, ‘c2’ hbase> get ‘t1’, ‘r1’, [‘c1’, ‘c2’]Besides the default ‘toStringBinary’ format, ‘get’ also supports custom formatting by column. A user can define a FORMATTER by adding it to the column name in the get specification. The FORMATTER can be stipulated:1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString) 2. or as a custom class followed by method name: e.g. ‘c(MyFormatterClass).format’.Example formatting cf:qualifier1 and cf:qualifier2 both as Integers: hbase> get ‘t1’, ‘r1’ {COLUMN => [‘cf:qualifier1:toInt’, ‘cf:qualifier2:c(org.apache.hadoop.hbase.util.Bytes).toInt’] }Note that you can specify a FORMATTER by column only (cf:qualifer). You cannot specify a FORMATTER for all columns of a column family.The same commands also can be run on a reference to a table (obtained via get_table or create_table). Suppose you had a reference t to table ‘t1’, the corresponding commands would be: hbase> t.get ‘r1’ |
| get_counter | Return a counter cell value at specified table/row/column coordinates. A cell cell should be managed with atomic increment function oh HBase and the data should be binary encoded. Example:hbase> get_counter ‘t1’, ‘r1’, ‘c1’The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding command would be:hbase> t.get_counter ‘r1’, ‘c1’ |
| incr | Increments a cell ‘value’ at specified table/row/column coordinates. To increment a cell value in table ‘t1’ at row ‘r1’ under column ‘c1’ by 1 (can be omitted) or 10 do:hbase> incr ‘t1’, ‘r1’, ‘c1’ hbase> incr ‘t1’, ‘r1’, ‘c1’, 1 hbase> incr ‘t1’, ‘r1’, ‘c1’, 10The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding command would be:hbase> t.incr ‘r1’, ‘c1’ hbase> t.incr ‘r1’, ‘c1’, 1 hbase> t.incr ‘r1’, ‘c1’, 10 |
| put | Put a cell ‘value’ at specified table/row/column and optionally timestamp coordinates. To put a cell value into table ‘t1’ at row ‘r1’ under column ‘c1’ marked with the time ‘ts1’, do:hbase> put ‘t1’, ‘r1’, ‘c1’, ‘value’, ts1The same commands also can be run on a table reference. Suppose you had a reference t to table ‘t1’, the corresponding command would be:hbase> t.put ‘r1’, ‘c1’, ‘value’, ts1 |
| scan | Scan a table; pass table name and optionally a dictionary of scanner specifications. Scanner specifications may include one or more of: TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH, or COLUMNS, CACHEIf no columns are specified, all columns will be scanned. To scan all members of a column family, leave the qualifier empty as in ‘col_family:’.The filter can be specified in two ways: 1. Using a filterString – more information on this is available in the Filter Language document attached to the HBASE-4176 JIRA 2. Using the entire package name of the filter.Some examples:hbase> scan ‘.META.’ hbase> scan ‘.META.’, {COLUMNS => ‘info:regioninfo’} hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], LIMIT => 10, STARTROW => ‘xyz’} hbase> scan ‘t1’, {COLUMNS => ‘c1’, TIMERANGE => [1303668804, 1303668904]} hbase> scan ‘t1’, {FILTER => “(PrefixFilter (‘row2’) AND (QualifierFilter (>=, ‘binary:xyz’))) AND (TimestampsFilter ( 123, 456))”} hbase> scan ‘t1’, {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}For experts, there is an additional option — CACHE_BLOCKS — which switches block caching for the scanner on (true) or off (false). By default it is enabled. Examples:hbase> scan ‘t1’, {COLUMNS => [‘c1’, ‘c2’], CACHE_BLOCKS => false}Also for experts, there is an advanced option — RAW — which instructs the scanner to return all cells (including delete markers and uncollected deleted cells). This option cannot be combined with requesting specific COLUMNS. Disabled by default. Example:hbase> scan ‘t1’, {RAW => true, VERSIONS => 10} Besides the default ‘toStringBinary’ format, ‘scan’ supports custom formatting 1. either as a org.apache.hadoop.hbase.util.Bytes method name (e.g, toInt, toString) Example formatting cf:qualifier1 and cf:qualifier2 both as Integers: Note that you can specify a FORMATTER by column only (cf:qualifer). You cannot Scan can also be used directly from a table, by first getting a reference to a hbase> t = get_table ‘t’ Note in the above situation, you can still provide all the filtering, columns, |
| truncate | Disables, drops and recreates the specified table. Examples: hbase>truncate ‘t1’ |
4) HBase surgery tools
| assign | Assign a region. Use with caution. If region already assigned, this command will do a force reassign. For experts only. Examples: hbase> assign ‘REGION_NAME’ |
| balancer | Trigger the cluster balancer. Returns true if balancer ran and was able to tell the region servers to unassign all the regions to balance (the re-assignment itself is async). Otherwise false (Will not run if regions in transition). Examples: hbase> balancer |
| balance_switch | Enable/Disable balancer. Returns previous balancer state. Examples:hbase> balance_switch true hbase> balance_switch false |
| close_region | Close a single region. Ask the master to close a region out on the cluster or if ‘SERVER_NAME’ is supplied, ask the designated hosting regionserver to close the region directly. Closing a region, the master expects ‘REGIONNAME’ to be a fully qualified region name. When asking the hosting regionserver to directly close a region, you pass the regions’ encoded name only. A region name looks like this:TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396.The trailing period is part of the regionserver name. A region’s encoded name is the hash at the end of a region name; e.g. 527db22f95c8a9e0116f0cc13c680396 (without the period). A ‘SERVER_NAME’ is its host, port plus startcode. For example: host187.example.com,60020,1289493121758 (find servername in master ui or when you do detailed status in shell). This command will end up running close on the region hosting regionserver. The close is done without the master’s involvement (It will not know of the close). Once closed, region will stay closed. Use assign to reopen/reassign. Use unassign or move to assign the region elsewhere on cluster. Use with caution. For experts only. Examples:hbase> close_region ‘REGIONNAME’ hbase> close_region ‘REGIONNAME’, ‘SERVER_NAME’ |
| compact | Compact all regions in passed table or pass a region row to compact an individual region. You can also compact a single column family within a region. Examples: Compact all regions in a table: hbase> compact ‘t1’ Compact an entire region: hbase> compact ‘r1’ Compact only a column family within a region: hbase> compact ‘r1’, ‘c1’ Compact a column family within a table: hbase> compact ‘t1’, ‘c1’ |
| flush | Flush all regions in passed table or pass a region row to flush an individual region. For example:hbase> flush ‘TABLENAME’ hbase> flush ‘REGIONNAME’ |
| major_compact | Run major compaction on passed table or pass a region row to major compact an individual region. To compact a single column family within a region specify the region name followed by the column family name. Examples: Compact all regions in a table: hbase> major_compact ‘t1’ Compact an entire region: hbase> major_compact ‘r1’ Compact a single column family within a region: hbase> major_compact ‘r1’, ‘c1’ Compact a single column family within a table: hbase> major_compact ‘t1’, ‘c1’ |
| move | Move a region. Optionally specify target regionserver else we choose one at random. NOTE: You pass the encoded region name, not the region name so this command is a little different to the others. The encoded region name is the hash suffix on region names: e.g. if the region name were TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396. then the encoded region name portion is 527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus startcode. For example: host187.example.com,60020,1289493121758 Examples:hbase> move ‘ENCODED_REGIONNAME’ hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’ |
| split | Split entire table or pass a region to split individual region. With the second parameter, you can specify an explicit split key for the region. Examples: split ‘tableName’ split ‘regionName’ # format: ‘tableName,startKey,id’ split ‘tableName’, ‘splitKey’ split ‘regionName’, ‘splitKey’ |
| unassign | Unassign a region. Unassign will close region in current location and then reopen it again. Pass ‘true’ to force the unassignment (‘force’ will clear all in-memory state in master before the reassign. If results in double assignment use hbck -fix to resolve. To be used by experts). Use with caution. For expert use only. Examples:hbase> unassign ‘REGIONNAME’ hbase> unassign ‘REGIONNAME’, true |
| hlog_roll | Roll the log writer. That is, start writing log messages to a new file. The name of the regionserver should be given as the parameter. A ‘server_name’ is the host, port plus startcode of a regionserver. For example: host187.example.com,60020,1289493121758 (find servername in master ui or when you do detailed status in shell)hbase>hlog_roll |
| zk_dump | Dump status of HBase cluster as seen by ZooKeeper. Example: hbase>zk_dump |
5) Cluster replication tools
| add_peer | Add a peer cluster to replicate to, the id must be a short and the cluster key is composed like this: hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent This gives a full path for HBase to connect to another cluster. Examples:hbase> add_peer ‘1’, “server1.cie.com:2181:/hbase” hbase> add_peer ‘2’, “zk1,zk2,zk3:2182:/hbase-prod” |
| remove_peer | Stops the specified replication stream and deletes all the meta information kept about it. Examples:hbase> remove_peer ‘1’ |
| list_peers | List all replication peer clusters. hbase> list_peers |
| enable_peer | Restarts the replication to the specified peer cluster, continuing from where it was disabled.Examples:hbase> enable_peer ‘1’ |
| disable_peer | Stops the replication stream to the specified cluster, but still keeps track of new edits to replicate.Examples:hbase> disable_peer ‘1’ |
| start_replication | Restarts all the replication features. The state in which each stream starts in is undetermined. WARNING: start/stop replication is only meant to be used in critical load situations. Examples:hbase> start_replication |
| stop_replication | Stops all the replication features. The state in which each stream stops in is undetermined. WARNING: start/stop replication is only meant to be used in critical load situations. Examples:hbase> stop_replication |
6) Security tools
| grant | Grant users specific rights. Syntax : grantpermissions is either zero or more letters from the set “RWXCA”. READ(‘R’), WRITE(‘W’), EXEC(‘X’), CREATE(‘C’), ADMIN(‘A’)For example:hbase> grant ‘bobsmith’, ‘RWXCA’ hbase> grant ‘bobsmith’, ‘RW’, ‘t1’, ‘f1’, ‘col1’ |
| revoke | Revoke a user’s access rights. Syntax : revoke For example:hbase> revoke ‘bobsmith’, ‘t1’, ‘f1’, ‘col1’ |
| user_permission | Show all permissions for the particular user. Syntax : user_permission For example:hbase> user_permission hbase> user_permission ‘table1’ |
Leave a comment