Categories
Uncategorized

Resolving Hadoop Problems on Kerberized CDH 5.X

I ran into a problem in which I had a Kerberized CDH cluster and couldn’t run any hadoop commands from the command line, even with a valid Kerberos ticket.

So with a valid ticket, this would fail:
hadoop fs -ls /
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Here is what I learned and how I ended up resolving the problem. I have linked to Cloudera doc for the current version where possible, but some of the doc seems to be present only for older versions.

Please note that the problem comes down to a configuration issue but that Kerberos itself and Cloudera Manager were both installed correctly. Many of the problems I ran across while searching for answers came down to Kerberos or Hadoop being installed incorrectly. The problem I had occurred even though both Hadoop and Kerberos were functional, but they were not configured to work together properly.

TL;DR

MAKE SURE YOU HAVE A TICKET

Do a klist from the user you are trying to execute the hadoop command.

sudo su - myuser
klist

If you don’t have a ticket, it will print:

klist: Credentials cache file '/tmp/krb5cc_0' not found

If you try to do a hadoop command without a ticket you will get the GSS INITIATE FAILED error by design:
WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

In other words, that is not an install problem. If this is your situation, take a look at http://www.roguelynn.com/words/explain-like-im-5-kerberos/ . For other troubleshooting of Kerberos in general, check out https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html


CDH Default HDFS User and Group Restrictions

A default install of Cloudera has user and group restrictions on execution of hadoop commands, including a specific ban on certain users ( more on page 57 of http://www.cloudera.com/documentation/enterprise/5-6-x/PDF/cloudera-security.pdf ).
There are several properties that deal with this:

Specifically for user hdfs, make sure you have removed hdfs from the banned.users configuration property in hdfs-site.xml configuration if you are trying to use it to execute hadoop commands.

1) Unprivileged User and Write Permissions

The Cloudera-recommended way to execute Hadoop commands is to create an unprivileged user and matching principal, instead of using the hdfs user. A gotcha is that this user also needs its own /user directory and can run into write permissions errors with the /user directory. If your unprivileged user does not have a directory in /user, it may result in the WRITE permissions denied error.

Cloudera Knowledge Article

http://community.cloudera.com/t5/CDH-Manual-Installation/How-to-resolve-quot-Permission-denied-quot-errors-in-CDH/ta-p/36141

2) Datanode Ports and Data Directory Permissions
Another related issue is that Cloudera sets dfs.datanode.data.dir to 750 on a non-kerberized cluster, but requires 700 on a kerberized cluster. With the wrong dir permissions set, the Kerberos install will fail. The ports for the datanodes must also be set to values below 1024, which are recommended as 1006 for the HTTP port and 1004 for the Datanode port.

Datanode Directory

http://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hdfs_cluster_deploy.html

Datanode Ports

http://www.cloudera.com/documentation/archive/manager/4-x/4-7-2/Configuring-Hadoop-Security-with-Cloudera-Manager/cmchs_enable_security_s9.html

3) Service Specific Configuration Tasks

On page 60 of the CDH security doc, there are steps to kerberize Hadoop services. Make sure you did these!

MapReduce

sudo -u hdfs hadoop fs -chown mapred:hadoop
${mapred.system.dir}

HBase

sudo -u hdfs hadoop fs -chown -R hbase ${hbase.rootdir}

Hive

sudo -u hdfs hadoop fs -chown hive /user/hive

YARN

rm -rf ${yarn.nodemanager.local-dirs}/usercache/*

All of these steps EXCEPT for the YARN one can happen at any time. The step for YARN must happen after Kerberos installation because what it is doing is removing the user cache for non-kerberized YARN data. When you run mapreduce after the Kerberos install it should populate this with the Kerberized user cache data.

YARN User Cache
http://stackoverflow.com/questions/29397509/yarn-application-exited-with-exitcode-1000-not-able-to-initialize-user-directo

Kerberos Principal Issues

1) Short Name Rules Mapping
Kerberos principals are “mapped” to the OS-level services users. For example, hdfs/WHATEVER@REALM maps to the service user ‘hdfs’ in your operating system only because of a name mapping rule set in the core-site of Hadoop. Without name mapping, Hadoop wouldn’t know which user is authenticated by which principal.

If you are using a principal that should map to hdfs, make sure the principal name resolves correctly to hdfs according to these Hadoop rules.

Good
(has a name mapping rule by default)

  • hdfs@REALM
  • hdfs/_HOST@REALM

Bad
(no name mapping rule by default)

  • hdfs-TAG@REALM

The “bad” example will not work unless you add a rule to accommodate it

Name Rules Mapping
http://www.cloudera.com/documentation/archive/cdh/4-x/4-5-0/CDH4-Security-Guide/cdh4sg_topic_19.html

2) Keytab and Principal Key Version Numbers Must Match
The Key Version Number (KVNO) is the version of the key that is actively being used (as if you had a house key but then changed the lock on the door so it used a new key, the old one is no longer any good). Both the keytab and principal have a KVNO and the version number must match.

By default, when you use ktadd or xst to export the principal to a keytab, it changes the keytab version number, but does not change the KVNO of the principal. So you can end up accidentally creating a mismatch.

Use -norandkey with kadmin or kadmin.local when exporting a principal to a keytab to avoid updating the keytab number and creating a KVNO mismatch.

In general, whenever having principal issues authentication issues, make sure to check that the KVNO of the principal and keytab match:
Principal
kadmin.local -q 'getprinc myprincipalname'

Keytab
klist -kte mykeytab

Creating Principals
http://www.cloudera.com/documentation/archive/cdh/4-x/4-3-0/CDH4-Security-Guide/cdh4sg_topic_3_4.html

Security Jars and JAVA Home

1) Java Version Mismatch with JCE Jars
Hadoop needs the Java security JCE Unlimited Strength jars installed in order to use AES-256 encryption with Kerberos. Both Hadoop and Kerberos need to have access to these jars. This is easy to miss because you can think you have the security jars installed when you really don’t.

JCE Configurations to Check

  • the jars are the right version – the correct security jars are bundled with Java, but if you install them after the fact you have to make sure the version of the jars corresponds to the version of Java or you will continue to get errors.
    To troubleshoot, check the md5sum hash of the JCE jars from a brand new download of the same exact JDK that you’re using against the md5sum hash of the ones on the Kerberos server.
  • the jars are in the right location ( JAVA_HOME/jre/lib/security )
  • Hadoop is configured to look for them in the right place. Check if there is an export statement for JAVA_HOME to the correct Java install location in /etc/hadoop/conf/hadoop-env.sh

If Hadoop has JAVA_HOME set incorrectly it will fail with GSS INITIATE FAILED. If the jars are not in the right location, Kerberos won’t find them and will give an error that it doesn’t support the AES-256 encryption type (UNSUPPORTED ENCTYPE)

Cloudera with JCE Jars
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_sg_s2_jce_policy.html

Troubleshooting JCE Jars
https://community.cloudera.com/t5/Cloudera-Manager-Installation/Problem-with-Kerberos-amp-user-hdfs/td-p/6809

Ticket Renewal with JDK 6 and MIT Kerberos 1.8.1 and Higher

Cloudera has an issue documented at http://www.cloudera.com/documentation/archive/cdh/3-x/3u6/CDH3-Security-Guide/cdh3sg_topic_14_2.html in which tickets must be renewed before hadoop commands can be issued. This only happens with Oracle JDK 6 Update 26 or earlier and package version 1.8.1 or higher of the MIT Kerberos distribution. To check the package, do an rpm -qa | grep krb5 on CentOS/RHEL or aptitude search krb5 -F "%c %p %d %V" on Debian/Ubuntu.

The workaround given by Cloudera is to do a regular kinit as you would, then do a kinit -R to force the ticket to be renewed.
kinit -kt mykeytab myprincipal
kinit -R

And finally, the issue I actually had which I could not find documented anywhere …

Configuration Files and Ticket Caching


There are two important configuration files for Kerberos, the krb5.conf and the kdc.conf. These are configurations for the krb5kdc service and the KDC database. My problem was the krb5.conf file had a property:
default_ccache_name = KEYRING:persistent:%{uid}

This set my cache name to KEYRING:persistent and user uid ( explained https://web.mit.edu/kerberos/krb5-1.13/doc/basic/ccache_def.html ). When I did a kinit, it created the ticket in /tmp because the cache name was being set elsewhere as /tmp. Cloudera services obtain authentication with files generated at runtime in /var/run/cloudera-scm-agent/process , and these all export the cache name environment variable ( KRB5CCNAME ) before doing their kinit. That’s why Cloudera could obtain tickets but my hadoop user couldn’t.

The solution was to remove the line from krb5.conf that set default_ccache_name and allow kinit to store credentials in /tmp , which is the MIT Kerberos default value DEFCCNAME ( documented at https://web.mit.edu/kerberos/krb5-1.13/doc/mitK5defaults.html#paths )

Liked this post and want to hear more? Follow me at https://twitter.com/saranicole and connect at https://www.linkedin.com/in/sarastreeter

Cloudera and Kerberos installation guides

Step-by-Step
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_sg_intro_kerb.html
Advanced troubleshooting
http://www.cloudera.com/documentation/enterprise/5-6-x/PDF/cloudera-security.pdf , starting on page 48