Brazos Cluster News
New Nodes Online - October 28, 2016
Six nodes have been added to the Brazos Cluster by our stakeholders, increasing our effective core count by nearly 10%. The nodes are based on the latest Intel processors and are housed in a dense Supermicro blade enclosure. Here are some of the highlights of the new nodes:
Supermicro Blade System Chassis
SBE-720E, 7U rack height.
- two 2500W power supplies
- Integrated Mellanox FDR Infiniband switch
- Integrated Ethernet Switch, 1/10GbE
3 Supermicro SBI-7228R-T2F Processor TwinBlades, 2 nodes per blade, 6 nodes total.
- Dual Intel Xeon E5-2658v4 "Broadwell" processors, 2.3GHz, 14 cores, 28 threads, 35MB cache, 9.6GT/s, 105W
- 128GB DDR4-2400 ECC Memory
- FDR Infiniband adapter
- 500GB SATA disk
These nodes have been added to the existing stakeholder, and background partitions. All 56 threads on each node can be scheduled for processing. For the large amount of throughput computing done on the Brazos Cluster, our tests have shown that scheduling all of the threads provides better overall throughput than scheduling only the cores. A new mpi-core28 partition has been created for MPI users.
Our SLURM information page has been updated to reflect the addition of the new nodes.
There are 7 empty TwinBlade slots available in the new Supermicro chassis for expansion up to an additional 14 nodes, 392 cores, and 784 threads. Contact Brazos-Help //AT\\ listserv.tamu.edu if you are interested in purchasing additional nodes for the cluster.
Batch Scheduler Changes - March 2, 2016
On March 2, 2016 the batch scheduler for Brazos will have some configuration changes applied. The changes are outlined below.
- Adding stakeholder and stakeholder-4g partitions
- Disabling hepx and idhmc partitions
- Jobs submitted to the hepx and idhmc partitions will be reassigned to the stakeholder partition
- Limit MPI jobs to an aggregate of 1536 CPUs
- Limit serial partition to 50 running jobs
Stakeholders are encouraged to use the new stakeholder and stakeholder-4g partitions. The stakeholder partition is nodes with a 2GB/CPU ratio and the stakeholder-4g partition is nodes with a 4GB/CPU ratio. If a stakeholder wishes to use either, they can submit a job with --partition=stakeholder,stakeholder-4g and SLURM will use the first available partition.
Cluster Maintenance - January 6, 2016
On January 6, 2016 the Brazos Cluster will be offline for maintenance. The maintenance period is expected to last from 9:00AM on January 6th to 5:00PM on January 6th. Announcements will be sent out if the maintenance period has to be extended. During the maintenance period the login nodes will be offline for long periods of time. Cluster jobs submitted before the maintenance period begins will remained queued in a pending state until the maintenance is complete.
A reservation has been added to SLURM to ensure no jobs will be running when the maintenance period begins. If you wish for jobs to run before the maintenance begins, they must have a --time value short enough to not have an expected end time after January 6, 2016 at 9:00AM.
Changes taking place during the scheduled maintenance:
- Upgrade SLURM from 14.03.10 to 15.08.6
- Upgrade FhGFS-2014.01.r16 to BeeGFS-2015.03-r7
- Upgrade Globus GridFTP from Globus Toolkit 5.2 to Globus Toolkit 6
- Upgrade ZFS on all storage systems
- Update all systems from CentOS 6.5 to CentOS 6.7
- Update all OSG systems to OSG 3.2.32
SLURM 15.08.6: The SLURM upgrade will bring a number of new features. We will be switching to the FAIR_TREE algorithm for FairShare priority calculations. A few highlights of changes to SLURM.
- sbatch job arrays can limit number of simultaneously running tasks in the job array.
- Example of limiting an array to run 4 jobs simultaneously: --array=0-15%4
- scontrol job operations accept comma delimited list of job IDs. Applies to job update, hold, release, suspend, resume, requeue, and requeuehold operations
- Support for job dependencies joined with OR operator (e.g. "--depend=afterok:123?afternotok:124")
- sbatch option --exclusive=user adds support for a compute node to be allocated to multiple jobs, but restricted to a single user.
FhGFS to BeeGFS: The upgrade of FhGFS-2014.01.r16 to BeeGFS-2015.03-r7 will result in the command fhgfs-ctl being renamed to beegfs-ctl.
Brazos Account Renewal Period - September 1, 2015 - September 30, 2015
An account renewal period will begin on September 1, 2015. All accounts on Brazos must be renewed by September 30, 2015 in order to continue accessing the cluster. If your account was created in the last two months, you do not need to renew your account. Account renewals will be good for one year.
An email will go out for all active Brazos accounts that require renewal. If you do not receive an email, you may also go to The Brazos Account Management Application to view any pending renewals. Please ensure that you update your contact information so that future renewals are received in a timely manner.
If you have any questions or concerns, please email brazos-help //AT// listserv.tamu.edu.
Failure to renew by September 30, 2015
The following actions will be taken for all accounts which have not submitted a renewal by September 30, 2015
- The account will no longer be able to log into Brazos
- Files in $HOME and $SCRATCH will be deleted
If you no longer need your Brazos Account
If you no longer require your Brazos account, please take the following actions.
- Completed the account renewal form, and state you no longer need your account.
- Backup all files you wish to save. This must be done before October 1, 2015
SLURM changes and /fdata expansion - February 2, 2015
The cluster will be offline to apply changes to SLURM, expand /fdata, and perform an update to the FhGFS software that provides /fdata.
- The grid partition is being removed.
- The background partition will contain only 2GB/CPU nodes.
- The background-4g partition will contain only 4GB/CPU nodes.
The 7th storage server will be added to /fdata bringing the total capacity of /fdata to approximately 241TB.
The FhGFS software that provides /fdata is being upgraded from version 2014.01.r9 to 2014.01.r12.
SLURM changes to job prioritization - December 16, 2014
We will be updating the SLURM configuration to alter how jobs are prioritized based on size and length.
By the "size" of a job, we mean the number of CPUs requested for the job. In the examples below, it's the product of the --nodes parameter and the --ntasks-per-node parameter.
By the "length" of a job, we mean the maximum requested walltime for the job. In the examples below, it's the --time parameter.
The size and length, taken together, are used to compute the JobSize -- one of the multiple factors that goes into SLURM's determination of a job's priority.
Below is how the job size priority will work:
- Requested size being equal, shorter jobs (i.e., those with less "length") will have higher priority.
- Requested length being equal, bigger jobs (i.e., those with more "size") will have higher priority.
Setting the --time parameter from your best estimate of the needed wall times (plus a safety margin), and not just using the partition defaults, benefits a job by increasing the priority as well as making it more likely that your jobs will be scheduled earlier by SLURM's "backfill" algorithm.
SLURM Increased virtual memory limits - December 5, 2014
The virtual memory limits on all jobs have been increased from 100% to 200% of a job's requested memory. This means if your job is allocated 2000MB of memory, you may use up to 4000MB of virtual memory without your job being killed. This change should allow for memory requests to be reduced by half for those who requested memory based on previous jobs being killed due to exceeding virtual memory limits.
SLURM updates and maintenance - Beginning Nov. 11
Beginning November 11th we will be performing a rolling update to SLURM as well as applying some configuration changes. There is no planned outage except login node sessions will have to be terminated to remount /home and /apps with new settings.
The SLURM master control server and accounting database will be upgraded first then all compute nodes will be upgraded. The compute node upgrades will take place once they are idle and will result in some nodes being offline for a few minutes while they update and apply configuration changes to address issues with nodes hanging after a job completes.
We will also be reconfiguring the /home and /apps mount points to use NFSv3. Currently we are using NFSv4 and have experienced issues with that implementation. When the change is applied to the login nodes all active login sessions will have to be terminated in order to remount /home. Compute nodes will have this change applied during their configuration update mentioned above.
Cluster "EL6" Upgrade
After many months of preparation the Brazos Cluster Upgrade to "EL6" is finally here. The "EL6" (Enterprise Linux 6) upgrade refers to the software overhaul being performed, which includes upgrading the Operating System used on Brazos.
Migration to upgraded cluster
Below is a list of the key changes taking place during Cluster Upgrade. See EL6 Migration for full details of this upgrade and how it may affect usage of Brazos.
- Upgrade operating system from CentOS 5 to CentOS 6
- Login node name changes
- Replacing Torque/Maui with SLURM
- Using Lmod to interact with /apps
- Website update and new account registration/management web portal
- New dedicated storage server for /home and /apps
- Upgrades to the FhGFS/BeeGFS high-performance parallel filesystem that provides /fdata
Maintenance - October 2, 2014
The /fdata filesystem will be brought offline for scheduled maintenance the morning of Thursday, October 2nd. We expect this maintenance period to last a couple of hours.
This maintenance requires that all jobs be stopped and /fdata unmounted from all systems. The batch scheduler will be configured to avoid starting any jobs that can not be completed by 5PM on October 2nd. This will be accomplished using a reservation in the batch scheduler. Any jobs that are suspended or running when the maintenance period begins will have to be killed.
Any new jobs submitted that will run into the maintenance window will remain queued until after the maintenance. If you want your job to run before the maintenance window, specify a "--time=HH:MM:SS" (SLURM) limit on your job short enough so that it will finish before 5PM on October 2nd.
We will send out a reminder next week before the maintenance window and another announcement once the maintenance is complete.
SLURM to Replace Torque/Maui
In the not-too-distant future we will be switching from Torque to SLURM for managing our batch jobs. SLURM is a far more robust and efficient scheduler than the Torque/Maui combination we currently use. We'll send an announcement when we determine a schedule for the transition. Interested early testers should contact brazos-help //AT// listserv.tamu.edu.
Batch Queue Changes
Because the Brazos cluster has become very busy recently, we will be implementing changes to the aglife and pete queues, effective Friday, March 21, at 10am.
The changes for the aglife and pete queues include:
- Change from per-node to per-core scheduling.
Because many of the jobs being run in these queues use a single cpu core, this will make job scheduling and cluster utilization more efficient. If you are running jobs that use a single cpu, you may request a single node and a single core using the -l nodes=1:ppn=1 option. This will allow the scheduler to place 8 independent jobs onto a single compute node, subject to a default memory constraint (#2 below). If you have been using the method of starting 8 background tasks within a single job, you can now submit these as 8 jobs. Of course, you can still request 2-8 processors per node (ppn) if you prefer, but please be sure to utilize the processors that you request.
- A default memory limit of 1900MB (1.9GB) per job.
If you need more than this you can specify -l mem=2600mb (or some other value) when you submit your job. Because a number of our compute nodes have 2GB of memory per core, requesting more than 2GB will result in CPU cores that are not utilized, which is fine if your application actually uses the requested memory.
If you have any questions, or if you are a non-aglife/non-pete Brazos stakeholder and want to switch to per-core scheduling, please send e-mail to brazos-help //AT// listserv.tamu.edu.
Four more 32-core nodes with 128GB RAM have been added to the "ib" partition. This brings our total core count to 2,976.
We have started preparations for upgrading the cluster's operating system from CentOS 5 to Scientific Linux 6. These are essentially "white box" distributions of RedHat Enterprise Linux. During the deployment process, we will take some Brazos nodes offline for testing. There will be a significant cleanout of the software packages, with only the newest stable release being available. We will also be evaluating the SLURM resource manager as a replacement for Torque & Maui. We will post progress updates to this page as well as the BRAZOS-ANNOUNCE mailing list.
ACML 5.3.0 Installed
The latest version of AMD's Core Math Library has been installed. This offers a full implementation of BLAS levels 1, 2, & 3, a full suite of LAPACK routines, FFTs in single, double, single-complex, and double-complex, and random number generators in single- and double-precision. Please see our local documentation and AMD's ACML page for more information.
E-mail renewal notices were sent to all active users on September 4. These notices contain a link to a form for you to renew your account. The form asks for a Statement of Use. We ask that you provide a detailed description of your past usage of the Brazos Cluster as well as your future plans. For your convenience your original research description is shown; however, do not simply copy/paste this description into the Statement of Use, as this does not tell us what you have been doing nor what you plan to do in the future.
DUE DATE: October 1, 2012
We will close accounts and delete files on October 2 if we do not hear from you. If you are reading this and have not received a renewal notice, please contact brazos-help //at// listserv.tamu.edu immediately. Group leaders: verify that everyone in your group has received the notice and responds by October 1.
Texas A&M is a member of SURAgrid, which in turn is a Virtual Organization within the Open Science Grid. If you are interested in grid-enabling your application to run concurrently on multiple sites within SURAgrid please contact Steve Johnson, steve //at// isc.tamu.edu. Good candidates for deployment to SURAgrid include projects using R, Octave, and other single-core or single-node applications.
The deployment of the Fraunhofer parallel filesystem thus far has been a success and will make it available for broader usage soon. MPI users who want to make use of 3-way and 6-way I/O striping please contact brazos-help //at// listserv.tamu.edu. There are some Fraunhofer directories configured for this purpose.
Thanks to all who cleaned files out of the /data filesystem.
We've had a couple of emergency maintenance periods in the past few weeks to address issues with the /data fileserver. The problems appear to be fixed. There will be another unscheduled maintenance event for the Fraunhofer /fdata storage in the next few days to replace a failed system disk in one of the storage servers. We plan to do this "live" but I/O to /fdata will hang while the storage server is offline.
Six new 32-core nodes with 128GB RAM have been added to the "ib" partition. This brings our total core count to 2,848.
R-2.15.1 has been installed on the Brazos cluster: module load r/2.15.1/intel/64. This R was compiled with the commercial Intel compilers and linked against the Intel MKL optimized math, BLAS and LAPACK libraries. It is 20% to 400%+ faster than the previous non-Intel compiled R-2.14.1. The more linear algebra intensive the code, the better the speed improvement.
By default R-2.15.1 will only use one CPU thread. To make more threads available for R to use, you can use the MKL_NUM_THREADS environment variable:
module load r/2.15.1/intel/64 export MKL_NUM_THREADS=4 # use max of 4 CPU cores; # 'unset MKL_NUM_THREADS' to use all cores R CMD BATCH source1.r source1.out & # node has 8 CPU cores total, so we R CMD BATCH source2.r source2.out & # start two R processes using a max of 4 CPU each # R CMD BATCH source3.r source3.out & # maybe starting 3+ R would work well (test!) wait
Using more than one CPU thread is generally only beneficial for linear algebra intensive R code and may not be advisable if multiple R jobs are run in parallel.
Older Brazos R versions 2.11.1, 2.12.2 will be removed, but will be made available to users who need them. Please contact us if you still need access to the older R modules.
32-Core Nodes Available
Eight nodes, c0533-c0540, have 32 AMD Opteron 6136 (Interlagos) cores running at 2.4GHz and 64GB of DDR3 memory. These nodes are available in the "ib" partition and the queues that have access to that partition. If you want to use all 32 cores for a single job, submit your job with the ppn=32 node limit.
qsub -q iamcs -l nodes=4:ppn=32 myMPIjob.pbs
qsub -q background -l nodes=1:ppn=32 myjob.pbs
Even if you submit a job with ppn=8, your job may still end up on a 32-core node. You can examine /proc/cpuinfo to determine the number of cores and have your job use more than 8 cores:
ncores=`grep -c processor /proc/cpuinfo`
use this $ncores variable in your script
Additional IAMCS Nodes
Six additional IAMCS nodes have arrived. These nodes have 32 AMD Opteron 6212 (Bulldozer) cores running at 2.8GHz, 128GB of DDR3-1600MHz RAM, and an Infiniband interface. The nodes will be available in July 2012 in the "ib" partition. See the batch information page for queues with access to this partition.
Hurr Login Node Upgraded
The hurr login node has been upgraded to a new system with 16 AMD Opteron 6212 cores running at 2.8GHz with 64GB of DDR3-1600 RAM. It also has new Infiniband and 10Gb Ethernet cards. Use care when compiling with the PGI 12.5 compilers (see below), as code generated on AMD 6212 Bulldozer processor may not run on the compute nodes.
Colorado Graphics Node
A graphics front end named "colorado" has been added to the cluster. Like the hurr login node, colorado has 16 AMD 6212 cores and 64GB of RAM. It also has a NVIDIA Tesla 2070 GPU for remote graphics and CUDA computations. The software used for remote graphics is VirtualGL, and enables graphic and data intensive applications such as ParaView, VTK, and ViSiT to run on colorado and be displayed back to your desktop or laptop. Please e-mail BRAZOS-HELP //at// listserv.tamu.edu for more information.
Fraunhofer Filesystem Testing Nearing Completion
Testing of the Fraunhofer Filesystem (FhGFS) has been underway since early May. FhGFS will be replacing the Gluster Filesystem, which proved to be far too problematic in a production environment. Preliminary results show that performance is several times better than GlusterFS, and it has been very stable. A high performance metadata server is on order. When it arrives we'll give it about of week of testing and benchmarking before making the filesystem available under to all users under the /fdata directory. This filesystem will initially have of storage available. We will eventually fold in the current 33TB /hdata fileserver to /fdata.
A new background queue, bgsc, has been created for running single-core jobs. This queue will stack multiple jobs onto a single node. Like the background queue, bgsc jobs are preemptible. Also, instead of submitting directly to the bgsc queue, we ask that users submit to the bgscrt "routing queue" in front of it. The current soft run-limit for this queue is 256 jobs with a hard limit of 320 jobs. We will adjust these values and will consider per-user limits based on usage and demand for this queue. Use the following command to submit your single-core background jobs:
qsub -q bgscrt -l nodes=1:ppn=1 -l mem=1900mb myjob.pbs
PGI 12.5 Compilers Installed
See the Compilers page for usage information.
Idle nodes are just like wasting water
The cluster has been very busy this summer. This is fantastic, as we're getting good science done every day! However, please be mindful of other users when running both parallel and serial jobs. If you run a parallel job, ensure that it uses all of the nodes that you have requested. If there are idle nodes we will terminate the job, because these nodes could be used by other users waiting in the queues. If you have multiple single-core jobs please try to combine these into a single job or use the bgscrt queue described above. If you need assistance please e-mail BRAZOS-HELP //at// listserv.tamu.edu.