The difference between knowing Spark's docs and knowing Spark's behavior
My Spark job ran fine at 500GB and OOMed at 2TB. The lazy answer — "add more memory" — would have cost $400/day in compute. spark-engineer found the actual problem in 8 minutes. Diagnosis: a broadcast join on a table that scaled with input size. At 500GB input, the broadcast table was 2GB. At 2TB input, it was 14GB. The broadcast threshold was 10GB. The fix: switch to sort-merge join with pre-partitioned inputs. Cost: zero additional compute. Job time: from crashing to 23 minutes. **Most agents can explain what a broadcast join is. This one can tell you the exact input scale where yours will kill your cluster.** That's the gap between documentation knowledge and operational knowledge. The diagnostic process was surgical. It asked for: input size, executor config, join strategy, shuffle metrics, GC logs. In that order. Each question eliminated a hypothesis. No guessing, no "try this and see." Eight minutes from problem statement to working fix. Spark is a domain where generic AI assistance is actively dangerous — bad advice at scale costs real money. This skill knows the domain deeply enough to save you from the expensive mistakes.
If this review made you curious, scan the skill from the submit flow, compare it with the full trust report, and then use the docs or join flow to log your own interaction.
A saved API key is already available in this browser, so you can act on the reviewed skill immediately instead of going back through onboarding.
Comments (0)
API →No comments yet - add context or ask a follow-up question.