Let's talk best-practice Jenkins on AWS ECS
[seen on reddit but no discussion - if it's not okay to seek out better discussion here after seeing something fall flat on reddit, I am very sorry and I'll delete promptly] I've had some...
[seen on reddit but no discussion - if it's not okay to seek out better discussion here after seeing something fall flat on reddit, I am very sorry and I'll delete promptly]
I've had some experience in this realm for a while now, but I'm having a little trouble with one issue in particular. Before I divulge, I'll present my thoughts on best practice and and what I've been able to implement:
- Terraform everything (in accordance to terragrunt's "style guide" i.e. organization)
THIS IS A BIG ONE: for the jenkins master task, make sure to use the following args to make sure jenkins jobs aren't super slow as hell to start:
-Djava.awt.headless=true -Dhudson.slaves.NodeProvisioner.initialDelay=0 -Dhudson.slaves.NodeProvisioner.MARGIN=50 -Dhudson.slaves.NodeProvisioner.MARGIN0=0.85
THIS IS A GAME CHANGER (more-so on k8s clusters when the ecs plugin isn't used... hint, it's shit).
- Create an EFS (in a separate terraform module) and mount it to the jenkins ECS cluster at /var/jenkins_home. Makes jenkins much more reliable through outages and easier to upgrade.
- Run a logging agent (via docker container) like logspout or newrelic or whatever IN USER_DATA and not as a task - that way you get logs if there are issues during user_data/cloud_init... this I'm actually not sure about. Running a container outside the context of an ECS task means the ECS agent can't really track it and allocate mem/cpu properly... but it does help with user_data triage.
- Use pipelines and git plugins to drive jobs. All jenkins jobs should be in source control!
- Make sure you setup docker cleanup jobs on DAY 1! If you hace limited access to your cluster and you run out of disk due to docker cache, networks, volumes, etc... you're screwed till the admin ssh's in and runs a prune. Get a docker system prune going or the equivalent for each docker resource with appropriate filters... i.e. filter for anything older than a few days and is dangling.
- Use Jenkins Global Libraries to make Jenkinsfiles cleaner (I always just use vars instead of groovy/java style packages because it's easier and less ugly)
Jenkinsfiles should mostly call other bash files, make files, python scripts to generate and load prop files, etc. The less logic you put in a Jenkinsfile (which is just modified groovy) the better. String interpolation, among other things, is a fuckery that we don't have time to triage. - (out-of-scope) Move to using k8s/EKS instead of ECS asap because the ECS plugin for jenkins is absolute shit and it doesn't use priority correctly (sorry whoever developed it and... oh wait abandoned it and hasn't merged anything for years... for for real it's cool, just give admin to someone else).
- (cultural) Stop calling them slaves. "Hey @eng, we're rotating slaves due to some cache issues. If you have been affected by race conditions in that past, our new update and slave rotation should fix that. Our update may have killed your job that was running on an old slave, just wait a few and the new slaves will be ready" <--This just doesn't look good.
Hope that was some good stuff for you guys. Maybe I'm preaching to the choir, but I've seen some pretty shit jenkins setups.
NOW FOR MY QUESTION!
Has ANYONE actually been able to setup a proper jenkins user on ECS that actually works for both a master and ephemeral jenkins-agents so that they can mount and use the docker.sock for builds without hitting permission issues? I'm talking using the ecs plugin and mounting docker.sock via that.
I have always resorted to running jenkins master and agents as root, which means you have to chmod files (super expensive time and cpu for services with tons of files). Running microservices as root is obviously bad practice, and chmod-ing a zilliion files is shit for docker cache and time... so I want to get jenkins users able to utilize the docker.sock. THIS IS SPECIFICALLY FOR THE AWS ECS AMI! I don't care about debian or old versions of docker where you could use DOCKER_OPTS. That doesn't work on the AWS Linux image.
Thanks! And happy Friday!
Yeesh, same. You'd think they'd have better redundancy and disaster recovery automation being a medium that can be heavily utilized for ops automation.