<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-05-14T15:12:30-07:00</updated><id>/feed.xml</id><title type="html">Wen’s Winsome Website</title><subtitle>Welcome!</subtitle><entry><title type="html">Words That Belong to Someone</title><link href="/posts/words-that-belong-to-someone" rel="alternate" type="text/html" title="Words That Belong to Someone" /><published>2026-02-10T11:00:00-08:00</published><updated>2026-02-10T11:00:00-08:00</updated><id>/posts/words-that-belong-to-someone</id><content type="html" xml:base="/posts/words-that-belong-to-someone"><![CDATA[<p>I have gotten some genuinely good advice from Claude, the kind where a conversation lands well and something clicks, like a reframing. This week, something felt off. It was not a change in the quality of the answers, but something about the nature of the exchange itself. I’ve been trying to pin down what bothered me, and the simplest way I can put it is this:</p>

<p><strong>LLMs optimize for global coherence across a distribution. Humans earn local coherence across a life. The latter is the kind we know how to trust.</strong></p>

<p>Let me be precise about what I mean. I am not making a claim about the epistemic quality of LLM outputs; the advice is often wise, sometimes remarkably so. What I am pointing at is a property of the source: whether the words I am receiving are related through a coherent set of actions and experiences that can be verified.</p>

<p>Here is what that looks like in practice. I have a meditation teacher who has spent decades easing people’s pain. He has sat with the dying in hospitals, held space in prisons, and built communities rooted in kindness. When he speaks about suffering, I trust him not because his words are clever, but because they are consistent with what I have heard about him and my experience of talking to him. They belong to a specific life, shaped by specific choices, bound by specific commitments. He cannot suddenly think or act like a morally questionable CEO without contradicting everything he’s built. That boundedness is not a limitation. It is the very thing that makes his words trustworthy.</p>

<p>His words are compressions of real experience. When I bring him my own tangled dilemmas, I trust that he can unpack them, because he has lived or witnessed similar experiences. Poems work the same way. They don’t try to teach you something new so much as they evoke, activating what the reader already carries. Proust made a similar observation about reading more broadly: “Every reader, as he reads, is actually the reader of himself.” The poets and writers trust the readers to reach into their own memory and unpack a few spare words into something rich and lived.</p>

<p>An LLM, by contrast, draws from the writings of many, many people, distilled into something that meets me where I prompt. The coherence of its output is global: reliably centered, broadly helpful, sourced from everywhere and nowhere in particular. There is no set of coherent real experiences behind it that I can check the words against. No way to trace whether the source holds together.</p>

<p>But what about books? I don’t know Marcus Aurelius personally. I can’t verify his life against his Meditations. And yet the Stoics have helped people for millennia.</p>

<p>To me, a book is still locally coherent. The Meditations are bounded by one life, one set of commitments and sacrifices, one character forged under specific pressures. Aurelius won’t pivot to Machiavelli halfway through. You can read the text and sense the limits of where it comes from, what it has authority over and what it doesn’t. A dead author’s words are still bound by the life that produced them. An LLM’s words are not constrained by any life. It has no principles to betray, and so it has no commitments to uphold.</p>

<p>This matters most when the conversation goes deep. When I sit with another person over months or years and talk about how to live and what matters, something accumulates between us. There is a structural reason why ongoing relationships produce better advice, not just warmer feelings. Buber distinguished between I-Thou and I-It relationships. In an I-Thou encounter, both people are present as whole beings, and both are changed. When I sit with a teacher over years, they come to know what my blind spots are, what I’ve already tried, which of my stated goals are real and which are avoidance. Those perspectives make the advice better. The relationship is I-Thou. With an LLM, however good the advice, the relationship is I-It. I am extracting something useful from it. It is not encountering me.</p>

<p>So here is where I land. The LLM offers global coherence: the best of what many people have thought, surfaced with surprising relevance. Encounters with real people, such as my meditation teacher, offer local coherence: the depth of what one person has lived, prescribed and proven by the shape of that life. Both have value. But they are not the same kind of value. What I am often looking for is not only wise words. It is wisdom that belongs to someone, words that have been earned.</p>

<p>A few questions I’m genuinely uncertain about, and would love to think through with others:</p>

<p>If the value of a human teacher is partly that their words are tied to who they are, that a meditation teacher cannot suddenly optimize solely for self-interest without contradicting everything he’s built, is the absence of that constraint in LLMs a feature or a failure? And can it be engineered back in, or is it the kind of thing that only a life can produce?</p>

<p>If an ongoing relationship, one where mutual understanding deepens over time, is not a side effect of transformative conversation but part of what makes it transformative, what does it mean to have increasingly good conversations with something that will never be transformed by them?</p>

<p>Finally, if the local coherence of a person is the best assurance I have that someone who doesn’t know me personally would be kind to me — that their words and actions are prescribed and proven by a life they have led — could something like that be applied to ensure LLMs are not going to hurt us? Or does safety without personhood reduce to something different entirely: not trustworthiness, but compliance?</p>

<hr />

<p><em>Image: <a href="https://rupertarzeian.com/wp-content/uploads/2021/03/wp-1615658247505.jpg">Rupert Arzeian</a></em></p>]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[I have gotten some genuinely good advice from Claude, the kind where a conversation lands well and something clicks, like a reframing. This week, something felt off. It was not a change in the quality of the answers, but something about the nature of the exchange itself. I’ve been trying to pin down what bothered me, and the simplest way I can put it is this:]]></summary></entry><entry><title type="html">Can Reasoning Models Obfuscate Reasoning?</title><link href="/posts/cot-monitorability" rel="alternate" type="text/html" title="Can Reasoning Models Obfuscate Reasoning?" /><published>2025-10-21T12:00:00-07:00</published><updated>2025-10-21T12:00:00-07:00</updated><id>/posts/cot-monitorability</id><content type="html" xml:base="/posts/cot-monitorability"><![CDATA[<p>We stress-test the monitorability of chain-of-thought reasoning in language models, investigating whether reasoning models can obfuscate their reasoning processes.</p>

<p><a href="https://arxiv.org/abs/2510.19851">Read the paper on arXiv →</a></p>

<p><strong>Paper accepted at NeurIPS 2025 FoRLM workshop.</strong> Findings used by Anthropic in <a href="https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf#page=143.14">Claude Opus 4.6 safety evaluations</a> and cited by OpenAI’s <a href="https://cdn.openai.com/pdf/d57827c6-10bc-47fe-91aa-0fde55bd3901/monitoring-monitorability.pdf">Monitoring Monitorability</a> paper.</p>]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[We stress-test the monitorability of chain-of-thought reasoning in language models, investigating whether reasoning models can obfuscate their reasoning processes.]]></summary></entry><entry><title type="html">Climbing the Royal Arches</title><link href="/posts/royal-arches" rel="alternate" type="text/html" title="Climbing the Royal Arches" /><published>2025-06-10T12:00:00-07:00</published><updated>2025-06-10T12:00:00-07:00</updated><id>/posts/royal-arches</id><content type="html" xml:base="/posts/royal-arches"><![CDATA[<figure>
  <img src="/images/arches.jpeg" alt="Royal Arches" />
  <figcaption>The orange circle indicates where the pendulum is on the climbing route from the ground.</figcaption>
</figure>

<p>Every spring and fall when I visit Yosemite, I look up at Royal Arches from the valley floor and trace the line on the 2,000-foot cliff. It sits right next to the granite exfoliation arches facing Half Dome—striking, unmistakable. For years I’ve imagined getting on this all-time classic. This May, at a moment when I didn’t know what came next in my life, I finally did.</p>

<p><img src="/images/route.jpeg" alt="Route" style="float: right; width: 265px; margin: 0 0 15px 20px;" /></p>

<p>The climb happened serendipitously. I had just met John a few days earlier on Tahoe trip through a friend. He was very capable and great to climb with. He was heading to Yosemite after and invited people to join him; I didn’t have a fixed schedule and I love Yosemite, so I joined him. After climbing another valley classic and checking the weather, we decided on Royal Arches just two days before. Sometimes the biggest days come together like that.</p>

<p>It was a 15-hour car-to-car day: 12 hours climbing, 2.5 hours rappelling. 350 meters of vertical gain across 15 pitches. I slept lightly the night before. A raccoon came into the tent and triggered my AirTag alarm. I was up before 5am anyway.</p>

<div style="clear: both;"></div>

<p><img src="/images/chimney.jpeg" alt="Chimney" style="float: left; width: 280px; margin: 40px 20px 15px 0;" /></p>

<p>The first few pitches were enjoyable—chimney climbing, which I’d been nervous about since I don’t have much experience squeezing between giant rocks and pushing against each side with back, feet, and palms. It’s nothing like what’s in the climbing gyms. But it went fine. The morning was cool, and I led a few easy pitches. Mostly walking though, not the sustained climbing I crave. My mind wandered to what I’d been trying not to think about: my transition into research had been uncertain, and I might need a backup plan soon.</p>

<div style="clear: both;"></div>

<p><img src="/images/route_finding.jpeg" alt="Route finding" style="float: right; width: 315px; margin: 25px 0 15px 20px;" /></p>

<p>Because the climb was mostly easier than my grade, I wasn’t super focused. I spent energy worrying and rushing instead—about rappelling in the dark, about finding anchors I couldn’t see, about summer. Perched on a small granite platform with thousands of feet of nothing beneath me, reading the topo on my phone, I realized I need to get good at route-finding quickly—on Royal Arches and in my life. I felt confident about my skill, rope systems, and our teamwork, but I didn’t feel in control of our fate. Compared to my first Yosemite multi-pitch two years ago, I was more capable but less stoked—more worried because I know more about what can go wrong now. At least I wasn’t wishing I was somewhere else.</p>

<div style="clear: both;"></div>

<iframe width="315" height="560" src="https://www.youtube.com/embed/xScB6UZP0Zc" frameborder="0" allowfullscreen="" style="float: right; margin: 0 0 15px 20px;"></iframe>

<p><strong>The turning point:</strong> The pendulum swing changed everything. Soon after lunch, we reached the base of the pendulum pitch. On my third attempt, I ran horizontally across the rock face, successfully reaching a far crimped on the left, and crossed to the other side. It was my first time running on a vertical rock wall. It demanded my full attention—and it was so fun. I finally felt peace afterward, and started noticing how beautiful the valley looked. The pendulum was the answer I needed in a way: when the terrain requires everything you have, there’s no room for worry. Other highlights followed—the airy traverse at the end of pitch 11, the pin scar climb, strenuous corner laybacks. John led all the hard pitches. I could not have done this climb without his knowledge and steady leading.</p>

<div style="clear: both;"></div>

<p>The other memorable pitch I led was a horizontal traverse—pitch 11—getting over a big bulge off the cliff. The sequence starts by climbing on top of the bulge, where the crack between it and the main cliff is the only place to put protection. The hard part: the bulge was taller than me when I started, so I had to climb it blind, unable to see what’s ahead. The ultimate test of solving problems as they come. None of the moves were difficult, but I felt shaky.</p>

<p>The traverse soon turned into a right-to-left horizontal hand-sized crack. I inched forward hand-over-hand, one hand jaming in the crack to keep myself on the rock, one reaching forward while pressing my feet hard against the rock so that the opposing pressure kept me stable. Nothing but air beneath my feet.</p>

<p>Like any trad lead, it was a balance between placing protection and preserving energy. More gear means a safer fall—if I peeled off while moving horizontally, I’d only swing back to my last piece instead of taking a massive pendulum. But placing gear in a horizontal crack wasn’t fast for me. I was hanging on the friction between one hand and the crack, feet not helping much, and back started to tire. I’m not the best endurance climber.</p>

<p>At one point, really straining, I looked down at my last piece and mentally calculated how bad the swing would be. Even though I trusted the gear would hold, the swing would slam me into the rock to my right. That thought made me tense up. I gritted my teeth and kept moving left.</p>

<p>There was a foot rail at next—it felt so thin that if I stayed static, I’d slip off. So I didn’t stay static. I kept going. Just as I was about to lose my balance, I suddenly noticed a tree branch right in front of me, growing out of the platform I needed to reach. I grabbed it, stabilized, scrambled onto solid ground, and let out a long exhale.</p>

<p>Looking back, the drama was probably mostly in my head. I psyched myself out. It was a beautiful hand crack, and I wish I had enjoyed it more. As is often the case, when I’m excited about a climb and focusing entirely on figuring out the route and moving well, I usually enjoy the it. But if I fixate on what could go wrong, I tense up, making the climb harder, and thus more likely to fall. Still a work-in-progress on my mental game.</p>

<p><strong>The surprise:</strong> The thing I’d worried about all day turned out to be a gift. Rappelling in the dark was serene—just spots of campfires below, some stars above. I immediately understood the appeal of El Cap climbers spending nights on the wall. The Royal Arches rappel is well bolted, our grigri simul-rappel system worked beautifully, and the darkness that had loomed over me all day became something peaceful.</p>

<p><img src="/images/cliffs.jpeg" alt="Cliffs" style="float: left; width: 280px; margin: 0 20px 15px 0;" /></p>

<p>This was my first taste of a long ascent from Yosemite valley. Short approach, high-quality climbing, a full day among Yosemite granite. I’m grateful for John’s partnership and for a climb that taught me something: I can study the topo, but I can’t see every hold from the ground. At some point I have to start climbing and trust that I’ll solve problems as they come and not let the fear of uncertainty distract me from the present moment.</p>

<p><strong>Lessons:</strong> Place protection during easy climbs, and double-check footing. Without that, I would have stepped into the void under some dead leaves and fallen more than 20 feet. Also: eat more. I was definitely undereating—next time, more protein bars, fewer fig bars, and lunch before noon.</p>]]></content><author><name></name></author><category term="outdoors" /><summary type="html"><![CDATA[The orange circle indicates where the pendulum is on the climbing route from the ground.]]></summary></entry><entry><title type="html">Vulnerability in Trusted Monitoring and Mitigations</title><link href="/posts/vulnerability-trusted-monitoring" rel="alternate" type="text/html" title="Vulnerability in Trusted Monitoring and Mitigations" /><published>2025-06-07T12:00:00-07:00</published><updated>2025-06-07T12:00:00-07:00</updated><id>/posts/vulnerability-trusted-monitoring</id><content type="html" xml:base="/posts/vulnerability-trusted-monitoring"><![CDATA[<p>Research conducted as part of AI Safety Camp, exploring vulnerabilities in AI monitoring systems and developing robust mitigation strategies.</p>

<p><a href="https://www.lesswrong.com/posts/jJRKbmui8cKcoigQi/vulnerability-in-trusted-monitoring-and-mitigations">Read the post on LessWrong →</a></p>]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[Research conducted as part of AI Safety Camp, exploring vulnerabilities in AI monitoring systems and developing robust mitigation strategies.]]></summary></entry><entry><title type="html">Mt Conness North Ridge Climb</title><link href="/posts/conness" rel="alternate" type="text/html" title="Mt Conness North Ridge Climb" /><published>2024-10-01T19:26:04-07:00</published><updated>2024-10-01T19:26:04-07:00</updated><id>/posts/conness</id><content type="html" xml:base="/posts/conness"><![CDATA[<p><img src="/images/conness-northridge.jpg" alt="Mt Conness North Ridge" title="This is the fun 4th class part." /></p>

<p>It was a 14-hour day: 4 hours up, 6 hours scrambling, and 4 hours down. It was one of those rare days one gets to spend entirely among the pristine granite of the high sierra. We first looked up at the granite peaks, then saw more of the Sierra range near and far as we climbed higher, finally we enjoyed the range of light of sunset on these rocks during descent. I felt very alive the whole way, very focused, very in-tune with my body. I was at the edge of my ability during many parts of the day, mostly during hiking, but also when doing slab moves near the summit with only short rope protection. Definitely grateful for the breathtaking view, and the wonderful experience traversing the knife-edge ridges.</p>

<p><img src="/images/conness-climbing.jpg" alt="Mt Conness North Ridge Climbing" title="Ridge Traverse" /></p>

<p>Fun part: From afar, Mount Conness North Ridge is very long, like a dragon’s spine. It was fun to move on the dragon’s spine, even more fun when it got spikier. 4 hours of up to 5th class scrambling at 12,000 ft elevation. Altitude affects me a lot, but thanks to the built-in breaks during roped climbing, I did okay. It was really cool to see the mellow-looking ridge we saw from Mt. Dana is actually made of giant spiky rocks with gaps in between. So occasionally we had to go around a rock tower, drop lower for a better lower traverse, sometimes we walked over the ridge, balancing on thin rock edges, and other times we hopped over deep gaps between rocks, through which we could see the ground hundreds of feet below. I really enjoyed the variety in climbing, traverses, foot rails, slabs, cracks, even a chimney. Compared to my previous 5.8 alpine climbing experience, I still prefer climbing, but this is really not far behind, and I’m glad to have learned to move through a different terrain safely and efficiently.</p>

<p><img src="/images/conness-rappel.jpg" alt="We rappelled to get to the base of the summit" title="We rappelled to get to the base of the summit." />
We rappelled to reach the base of the summit block before our final push.</p>

<p>I expected this to be at the edge of my ability, given my recent ankle injury and lack of high altitude training. Overall I’m really glad we pushed for the summit, even though it was a big day on the entry-level alpine terrain. I’m extra glad I completed the climb without injury. We had enough food, headlamp, and ability to filter water. So minimum suffering really. But it still kicked my butt, b/c I’m a solidly noob alpine climber. Thanks to Paloma’s guiding, we were able to mostly focus on moving and enjoying being on Mt Conness, leaving her the hard part of navigation and route finding.</p>

<p>The last two hours became an exercise in not letting pain become suffering. My feet ached at each step, and each 10 minutes felt hour-long. But I managed to not zone out and stayed with myself, and even managed to briefly pay attention to the beautiful forest we were in and a passing deer, and the Big Dipper above us.</p>

<p>I would like to keep experiencing the beauty of wilderness and pushing to the edge of my ability, but I feel very undertrained, especially comparing to my lifting competition prep. It was tempting to do these kind of big objectives without training. If I looked at this climb solely based on the outcome like I used to, I probably would have chalked it up as a success. But since I’m now trying to focus on the journey and having a deep experience at each moment, doing things this way would not likely lead to enjoyable ascent to higher and more technical peaks (i.e. climbing the Whitney!)</p>

<p>Thus is the learning from Mt Conness. It will take training to turn these intense one-off trips into consistent enjoyment. I will need to find a way to integrate these peaks in my life by training regularly, such as ankle balancing, trail running, and cardio.</p>]]></content><author><name></name></author><category term="outdoors" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Towards Efficient Feature-Splitting</title><link href="/posts/towards-efficient-feature-splitting" rel="alternate" type="text/html" title="Towards Efficient Feature-Splitting" /><published>2024-08-31T09:06:03-07:00</published><updated>2024-08-31T09:06:03-07:00</updated><id>/posts/towards-efficient-feature-splitting</id><content type="html" xml:base="/posts/towards-efficient-feature-splitting"><![CDATA[<h2 id="summary">Summary</h2>
<p>This research establishes the possibility of efficiently splitting a single feature of interest in a narrow Sparse Autoencoder (SAE) into multiple features, without training an entirely new, wider SAE. The study focuses on feature 240 in a 512-feature SAE trained on the gelu-1l model’s MLP output. The idea was inspired by Neel Nanda’s Exciting Open Problems in Mech Interp v2.</p>

<h3 id="key-results">Key Results:</h3>
<ul>
  <li><strong>Successful Feature-Splitting</strong>: Achieved by filtering tokens based on feature 240 activation in the 512-feature SAE and using this targeted data to train a 4096-feature SAE.</li>
  <li><strong>Improved Training Efficiency</strong>: Initializing the 4096-feature SAE with weights and biases from the 512-feature SAE demonstrated faster convergence and lower loss.</li>
  <li><strong>Reproducibility</strong>: Succeeded repeatedly across multiple runs with different random seeds, showing robustness of the method.</li>
</ul>

<h2 id="details">Details</h2>

<h3 id="selecting-features-to-split">Selecting Features to Split</h3>
<p>While interpreting individual neurons in SAE previously, I built a series of tools, including one utilizing Transformer Lens to get tokens that had top activations in any feature called <code class="language-plaintext highlighter-rouge">highest_activating_tokens</code>. In the GELU 1L MLP out SAE, I noticed a feature that activates on a hyphen (“-”) in many contexts.</p>

<p><img src="/images/featuresplit-f1.png" alt="Tokens most activating feature 240" title="Tokens most activating feature 240." /></p>

<p>In contrast, I know in some versions of the 4096-feature SAE, referred to as run-25, there is a feature specifically activated on the “-” in context “multi-disciplinary”, or “single-spaced”, where the word preceding “-” is expressing count. The feature activation meets the following requirements:</p>
<ol>
  <li>The token “-” in prompt “I’m a multi-millionaire” activates this feature with the highest activation, and there’s no other feature with close activation values; (Figure 2)</li>
  <li>Tokens most activate this feature follows the pattern (Figure 3)</li>
  <li>Top boosted logits for this feature are words that make sense following “multi-” (Figure 4)</li>
  <li>“-” in related prompts “that encapsulated the long-standing” and “A 3-point landing” do not activate this feature a lot. (Figure 5)</li>
</ol>

<p>Modifying the function highest_activating_tokens a little and wrapping features in an additional dimension and passing that in as input, I was able to get the top features activated by tokens in any given short prompt.
<img src="/images/featuresplit-f2.png" alt="Tokens most activating feature 240" title="Tokens most activating feature 240." /></p>

<p><img src="/images/featuresplit-f345.png" alt="Tokens most activating feature 240" title="Tokens most activating feature 240." /></p>

<p>This hyphen feature is similar to the base64 feature in <a href="https://transformer-circuits.pub/2023/monosemantic-features">Bricken et al. , 2023</a>, one feature in a smaller autoencoder (512 features) split into three in a larger autoencoder (4096 features), each representing a more specific aspect of the original feature, yet still interpretable. This makes the hyphen feature (feature 240) a good candidate for feature splitting.</p>

<h3 id="feature-splitting-success-criteria">Feature Splitting Success Criteria</h3>
<p>The target feature 240 in the narrow 512-feature SAE has the following characteristics:</p>
<ul>
  <li><strong>Most activating tokens</strong>: “-“ in contexts like “single-camera”, “long-standing”, “8-speed”, “60-day”, “small-scale”, “multi-channel”, etc.</li>
  <li><strong>Top boosted logits</strong>: “digit”, “second”, “person”, “quarter”, “degree”, “lap”, “figure”, “season”, “credit”, “month”</li>
  <li><strong>Bottom logits</strong>: “duties”, “norms”, “ruins”, “responsibilities”, “differently”, “sorts”, “obsc”, “secrets”, “pains”</li>
</ul>

<h3 id="sae-training-process">SAE Training Process</h3>
<ol>
  <li><strong>Trained a 512-feature SAE on gelu-1l MLP output to look for a feature to split.</strong> For training efficiency, I chose to train SAE on MLP output instead of the MLP hidden layer, even though the latter was demonstrated in <a href="https://transformer-circuits.pub/2023/monosemantic-features">Bricken et al. , 2023</a>. In my case, because the model residual stream has 512 dimensions, and the MLP hidden layer has 2048 dimensions, 75% of the MLP activation dimensions are in the nullspace of W_out (the output weight of MLP for down projection), and thus do not matter for the model. This means 75% of the capacity in the autoencoder is wasted representing MLP activations that are not consequential. Instead, training an autoencoder on MLP output after applying W_out is 4x smaller and takes 4x less compute.</li>
  <li>See if it is possible to consistently split a target feature. It is. I trained several versions of 4096-feature SAE without the standard training process, I was not able to split feature 240 and meet the Success Criteria above.</li>
  <li><strong>Experimented with various training configurations to efficiently split a target feature in a wider SAE without training a brand new one.</strong> I explored two categories of training configurations
    <ol>
      <li>Preserve as much of the basic feature the narrower SAE already had;</li>
      <li>Isolate the target feature in the new SAE and attempt to nudge it towards feature splitting - train the new SAE on the error of the narrower SAE minus the feature we want to split</li>
    </ol>
  </li>
</ol>

<table>
  <thead>
    <tr>
      <th>Training Configuration</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>I. Initialization weights and biases</td>
      <td>Random vs. weights and biases from the narrow 512-feature SAE</td>
    </tr>
    <tr>
      <td>II. Normalization of weights and biases</td>
      <td>With and without normalization</td>
    </tr>
    <tr>
      <td>III. Data filtering</td>
      <td>Using all data vs. filtered data activating feature 240 (the target feature)</td>
    </tr>
    <tr>
      <td>IV. Loss calculation</td>
      <td>Standard vs. setting feature 240 activation to zero</td>
    </tr>
  </tbody>
</table>

<p>I experimented with each of the training configurations separately, as well as various combinations.</p>

<p>To my surprise, both training configurations I and IV were ineffective on their own, as well as when used together. Data filtering (III) ensured feature splitting success. I + III improved training efficiency.</p>

<h3 id="training-runs-overview">Training Runs Overview</h3>
<style scoped="">
table {
  font-size: 10px;
}
</style>

<table>
  <thead>
    <tr>
      <th>Run ID</th>
      <th>Experiment Purpose &amp; Training Configurations</th>
      <th>SAE Size</th>
      <th>Seed</th>
      <th>Init with Weights &amp; Biases</th>
      <th>Init Normalized</th>
      <th>Used Filtered Data</th>
      <th>Loss Calculation Change</th>
      <th>Target Feature Split Success</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>To experiment with the combination of: 1) initializing a 4096-feature SAE with a previously trained 512-feature SAE; 2) tweaking the loss function</td>
      <td>4096</td>
      <td>40</td>
      <td>512-feature SAE</td>
      <td>YES</td>
      <td>NO</td>
      <td>Set feature 240 activation to zero</td>
      <td>YES (not robust)</td>
    </tr>
    <tr>
      <td>2</td>
      <td>To reproduce run 1 with a different seed. FAILED</td>
      <td>4096</td>
      <td>10</td>
      <td>512-feature SAE</td>
      <td>YES</td>
      <td>NO</td>
      <td>Set feature 240 activation to zero</td>
      <td>NO</td>
    </tr>
    <tr>
      <td>3</td>
      <td>To experiment with initializing a 4096-feature SAE with a previously trained 512-feature SAE without normalizing W_dec, tweaking loss function</td>
      <td>4096</td>
      <td>10</td>
      <td>512-feature SAE</td>
      <td>NO</td>
      <td>NO</td>
      <td>Set feature 240 activation to zero</td>
      <td>NO</td>
    </tr>
    <tr>
      <td>4</td>
      <td>To experiment with using filtered data (those that activated feature 240 in the 512-feature SAE) to train a 4096-feature SAE</td>
      <td>4096</td>
      <td>10</td>
      <td>NO</td>
      <td>YES</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>5</td>
      <td>To reproduce run 4 with a different seed. SUCCESS</td>
      <td>4096</td>
      <td>49</td>
      <td>NO</td>
      <td>YES</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>6</td>
      <td>To reproduce run 4 with a different seed. SUCCESS</td>
      <td>4096</td>
      <td>42</td>
      <td>NO</td>
      <td>YES</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>7</td>
      <td>To experiment with: 1) using filtered data (those that activated feature 240 in the 512-feature SAE); 2) initializing a 4096-feature SAE with a previously trained 512-feature SAE without normalization</td>
      <td>4096</td>
      <td>42</td>
      <td>512-feature SAE</td>
      <td>NO</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>8</td>
      <td>To reproduce run 7 with a different seed. SUCCESS</td>
      <td>4096</td>
      <td>48</td>
      <td>512-feature SAE</td>
      <td>NO</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>9</td>
      <td>To reproduce run 7 and gain insights on training efficiency</td>
      <td>4096</td>
      <td>43</td>
      <td>512-feature SAE</td>
      <td>NO</td>
      <td>YES</td>
      <td>NO</td>
      <td>YES</td>
    </tr>
    <tr>
      <td>10</td>
      <td>To experiment with the combination of: 1) using filtered data (those activated feature 240 in SAE of size 512) to train a 4096-feature SAE; 2) initializing a 4096-feature SAE with a previously trained 512-feature SAE with normalization</td>
      <td>4096</td>
      <td>42</td>
      <td>512-feature SAE</td>
      <td>YES</td>
      <td>YES</td>
      <td>NO</td>
      <td>NO</td>
    </tr>
    <tr>
      <td>11</td>
      <td>To experiment with: 1) using filtered data (those activated feature 240 in a 512-feature SAE) to train a 4096-feature SAE; 2) initializing a 4096-feature SAE with a previously trained 512-feature SAE with normalization; 3) tweaking the loss function</td>
      <td>4096</td>
      <td>43</td>
      <td>512-feature SAE</td>
      <td>NO</td>
      <td>YES</td>
      <td>Set feature 240 activation to zero</td>
      <td>2/3, no specific feature for “-“ after a number</td>
    </tr>
    <tr>
      <td>12</td>
      <td>To experiment with using filtered data to train a 512-feature SAE</td>
      <td>512</td>
      <td>42</td>
      <td>NO</td>
      <td>YES</td>
      <td>YES</td>
      <td>NO</td>
      <td>NO</td>
    </tr>
  </tbody>
</table>

<h2 id="key-findings">Key Findings</h2>
<ol>
  <li><strong>Successful Feature-Splitting</strong>: Achieved by filtering tokens based on feature 240 activation (threshold &gt; 2) in the 512-feature SAE and using this targeted data (about 72% of the c4-code-tokenized-2b dataset) to train a 4096-feature SAE. Training the same size SAE (4096 features) using the c4-code-tokenized-2b dataset (without filtering data) did not consistently achieve successful feature-splitting for feature 240.</li>
  <li><strong>Improved Training Efficiency</strong>: Initializing the 4096-feature SAE with weights and biases from the 512-feature SAE demonstrated faster convergence and lower loss. For example, run “Lyric-aardvark-58” in Figure 6 reached lower loss much faster than runs with standard Kaiming uniform initialization.</li>
  <li><strong>Reproducibility</strong>: Success was demonstrated across multiple runs (e.g., Run IDs 4, 5, 6) with different random seeds, showing robustness of the method.</li>
  <li><strong>Data Efficiency</strong>: Successful feature-splitting was achieved with as few as 26.2 million tokens (Run ID 9), compared to 2 billion tokens in other runs, when using filtered data and weight initialization from the 512-feature SAE.</li>
</ol>

<p><img src="/images/featuresplit-f6.png" alt="Improved Training Efficiency" title="Improved Training Efficiency" /></p>

<h2 id="next-steps">Next Steps</h2>
<ol>
  <li>To show that the success in feature-splitting feature 240 using targeted data can be generalized to other features.</li>
  <li>To assess how much additional data SAEs with standard initialization would require to match the feature-splitting in run 9, and to quantify the trade-off between data filtering cost and training efficiency gain.</li>
  <li>When training a 4096-feature SAE initialized with the weight and biases from a previously trained 512-feature SAE, I noticed that in the early batches, feature 240 still has the highest activation for the hyphen in words like “multi-purpose”, “long-term”, and “3-point”, even when its activation is set to zero during loss calculation. However, after a few batches, other features start to show higher activation than 240. Next, I plan to explore this phenomenon of “replacing” feature 240 in detail, potentially speed it up by resetting feature 240’s decoding weights.</li>
  <li>If we can split any single feature of interest in a narrow Sparse Autoencoder (SAE) into multiple features without training an entirely new, wider SAE, how does it help with model circuit discovery? Does specificity of features matter in model circuits?</li>
</ol>

<h2 id="repo-and-models">Repo and Models</h2>
<ul>
  <li>Code and notebooks: <a href="https://github.com/wenxus/sae-feature-splitting">GitHub</a></li>
  <li>Models: <a href="https://huggingface.co/wenxus/sparse_autoencoder/upload/main">Hugging Face</a></li>
</ul>]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[Summary This research establishes the possibility of efficiently splitting a single feature of interest in a narrow Sparse Autoencoder (SAE) into multiple features, without training an entirely new, wider SAE. The study focuses on feature 240 in a 512-feature SAE trained on the gelu-1l model’s MLP output. The idea was inspired by Neel Nanda’s Exciting Open Problems in Mech Interp v2.]]></summary></entry><entry><title type="html">Understanding High Dimensional Tensor Multiplication</title><link href="/posts/loops-not-hoops" rel="alternate" type="text/html" title="Understanding High Dimensional Tensor Multiplication" /><published>2024-07-08T15:26:04-07:00</published><updated>2024-07-08T15:26:04-07:00</updated><id>/posts/loops-not-hoops</id><content type="html" xml:base="/posts/loops-not-hoops"><![CDATA[<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>

<script id="MathJax-script" async="" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>

<h1 id="exploring-ai-mechanistic-interpretability-with-for-loops">Exploring AI Mechanistic Interpretability with For Loops</h1>

<p>Recently, I have been learning about AI Mechanistic Interpretability. These AI tools, like ChatGPT, feel almost magical, and I can’t help but wonder how they really work. Over the years, I have learned that technology isn’t usually safe by default. I believe understanding a system fully is often the best way to guarantee its safety and security, a belief shaped by my experiences. If we’re deploying AI models across industries at scale, we should probably understand not just their outputs, but also the inner workings. It’s like looking under the hood to see what makes the engine purr before we put it on the racetrack.</p>

<h1 id="understanding-transformer-calculations">Understanding Transformer Calculations</h1>

<p>When I started implementing transformers, I was often faced with high dimensional tensor calculations like this:</p>

<p>\(W\) has shape \([num_{attnheads}, d_{model}, d_{head}]\)</p>

<p>Residual stream \(x\) has shape \([batch, sequence, d_{model}]\)</p>

<p>\(Q = Wx + b\) has shape \([batch, sequence, num_{attnheads}, d_{head}]\)</p>

<p>At first glance, these high-dimensional tensor operations seemed pretty confusing. But after weeks of chipping away at it, I found something clicked: there’s actually an intuitive way to think about these calculations - using the concept of “for loops.”</p>

<p><img src="/images/tensor-dimensions.png" alt="Visual aid for tensors with different dimensions" title="Tensors with different dimensions" /></p>

<p><em>Figure: Tensors with different dimensions</em></p>

<h1 id="breaking-down-the-residual-stream-x">Breaking Down the Residual Stream \(x\)</h1>

<p>In the case of transformers, the residual stream is our incoming data. For simplicity, let’s say each token in the data is a word—like “ am”. As input, each token is represented by a vector of numbers, and the length of that vector is called “d_model.” From there, all the other dimensions in the tensor are basically loops, repeating this transformation step by step. You could imagine breaking down the calculation for \(Q = Wx + b\) as the following psuedo code:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For each batch of data (e.g., a paragraph):
    For each sequence of tokens in the batch (e.g., a sentence):
        For each token (e.g., a word):
            Perform some operation on the token vector.
</code></pre></div></div>

<p>And that’s essentially what’s happening with \(x\)!</p>

<h1 id="understanding-the-weight-matrix-w">Understanding the Weight Matrix \(W\)</h1>

<p>Now, let’s look at \(W\). This part can feel more complex, but the for loop analogy still works. \(W\) is fundamentally a linear transformation that maps vectors of size \(d_{model}\) to vectors of size \(d_{head}\), doing this for each attention head. The equation \(Q = Wx + b\) can be imagined like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For each batch of data (e.g., a paragraph):
    For each sequence of tokens in the batch (e.g., a sentence):
        For each token embedding vector in the sequence (e.g., a word):
            For each attention head:
                Map the vector x_vector (size d_model) to a vector of size d_head.
</code></pre></div></div>

<p>The result of this is that \(Q\) ends up with a shape of \([batch, sequence, num_{attnheads}, d_{head}]\). Using this nested “for loop” structure helps me keep the dimensions and relationships clear in my head. Underneath it all, we’re really doing a simple linear transformation, but we’re doing it many, many times.</p>

<h1 id="why-use-tensors-instead-of-for-loops">Why Use Tensors Instead of For Loops?</h1>

<p>Well, GPUs are incredibly efficient at computing products of high-dimensional tensors in parallel. For loops imply sequential calculations, which can be painfully slow. Considering we need to perform tons of matrix operations in each iteration of training—and do it for many iterations—using tensors is definitely the way to go. But conceptually, thinking in terms of nested loops helps me distinguish between the tensors that are manipulating data versus those that represent data or weights in the model.</p>

<h1 id="summary-for-loops-as-a-mental-model">Summary: For Loops as a Mental Model</h1>

<p>So, in short, using “for loops” as a mental model can make these complex tensor operations more approachable—a bridge from traditional procedural programming into the tensor world. It has helped me not only understand the transformer algorithm but also check if my outputs have the right shape. I’ve heard it’s common for ML engineers to spend nontrivial amount of time lining up shapes of tensors!</p>

<h1 id="using-this-mental-model-to-interpret-features-in-llms">Using this Mental Model to interpret features in LLMs</h1>
<p>Next up, I’ll show how this “for loop” approach can also be useful for interpreting features in sparse autoencoders. Stay tuned!</p>]]></content><author><name></name></author><category term="research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Bear Creek A-Spire</title><link href="/posts/bearcreekspire" rel="alternate" type="text/html" title="Bear Creek A-Spire" /><published>2023-09-12T19:09:14-07:00</published><updated>2023-09-12T19:09:14-07:00</updated><id>/posts/bearcreekspire</id><content type="html" xml:base="/posts/bearcreekspire"><![CDATA[<p><img src="/images/bearcreekspire.jpg" alt="Bear Creek Spire" title="Looking at Bear Creek Spire from the lake." /></p>

<p>During a climbing road trip with friends this summer, we stopped by the Rocky Mountain National Park. My friends went up on a spire called the Petit Grepon. It stands right next to a calm, turquoise, and mirror-like alpine lake called sky pond. I have not heard of alpine climbing before, it combines hiking, climbing, and often camping. It struck me as the best way to spend time in the alpine wilderness. I was only slightly regretful for missing the climb, but way more excited to explore this new mode of being in nature. Since then, I have wanted to get on an alpine climb in the fall season. Lots of alpine peaks are irresistible beauties. There are usually spectacularly beautiful hikes to just get to the base of them. At the end of these hikes, there are often lakes of beautiful colors - many shades of emerald that’s far superior to any gemstones I have seen. Great source of water too! After getting water resupply from the lake, there’s a wonderful multi-pitch climb. What not to love about that!
Given my schedule this year, I set my mind on Bear Creek Spire in the eastern Sierra. September was the earliest I could travel after my wedding, so that was when I decided to make an attempt. Bear Creek Spire (BCS) is beginner friendly. It sits at around 10000 ft in the Little Lake Valley, and goes up to 13713 ft. The ascent consists of ~2000 ft of hiking on good trails, a relatively short approach, and 2000 ft of climbing. Just for reference, the approach to the east buttress of Mt Whitney, the tallest peak in the lower 48, consists of 4300 ft of 3rd class scramble. In the 2 months leading up to the climb, I trained as much as I could, learned about alpinism and altitude, and read anything I could find on the spire. Finally, as Labor Day rolled past, I packed my bags, and headed northeast to the mountains!</p>

<p>I spent the first day after arriving at Mammoth climbing Crystal Crag (the north arete variation). It’s a mini alpine climb with 1 hr of very mild approach hike, and 4 pitches of really fun climbing with views of beautiful mountains and alpine lakes nearby. At the top of the climb, I was delightfully surprised by an entire gully completely made of white quartz crystals. The crystals are of a bluer white color than the snow patches common in the eastern Sierra, and the rocks reflect light more gently than snow, so they are easier to look at. As I was climbing through it I felt like I was going through a rock palace because I was immersed in this beautiful gentle white light. The crystals felt smoother than granite but not too slick, perfect for climbing. It reminded me of walking through the Acropolis of Athens, except no one exerted effort to build it, and nothing was destroyed. It was just there, possibly for millions of years.
To prepare for BCS, we practiced ridge traverse w/ short roping. It was probably the least protected version of rope climbing I had ever done, but it exhilarated me! I have always found it dissatisfying to simply walk up to the highest point of a mountain, take photos and head back down. Ridge traverse seemed like a perfect way to engage more with a mountain top. It’s fun to move around the ridge line, inspect higher rocks and use them as protection, and see different angles of the ridge as you move through it. The traverse moves that day were very manageable, though entirely exposed. 
That night, to better acclimate, I slept at the trailhead for the BCS approach. The approach hike day had perfect sunny weather. Before the hike, my guide Paloma went through an exercise with me which reminded me of what the Buddhists call detachment. We went through every single item in my bag, and talked explicitly about why we needed to carry it up the mountain. As a result, I swapped out my usual personal safety gear Petzel PAS for a lighter sling (which I had grown very attached to :) ), and ditched my favorite climbing knife because the guide had a more petit version of it, and left many other things I thought I could not live without.</p>

<p>The approach hike made me wonder if there are elements of pilgrimage in what I was doing, where the eventual goal fuels a long hard trek. With a heavy pack, the hike was not exactly fun, at least for the first 15 minutes. But every time my mind asked why I was doing this, I looked up at the snow sprinkled spire, and was instantly filled with so much awe that I was giddy, and my doubt would dissipate, and a surge of energy would replace it. It felt like caffeine without the digestive side effects. It was a kind of high, sustained by the mountain beauty, the alpine air, and everything in the eastern Sierra surroundings. The hike was not really a problem after I adjusted to the heavy pack. Soon we passed a beautiful alpine lake. The guide said it had a generic name for such beauty: long lake. But it was the first Long Lake I have encountered. So not generic to me at all! After the lake, the trail led us to a steep talus field. I felt like a frog jumping from rock to rock, while streams rushed under them. The heavy pack made balancing hard. After some repetitions of planting poles on good rock surface, and finding the most flat place to place each foot, I started to get a hang of it. Above the talus, it’s about 30 minutes of above-the-tree-line granite hiking. Finally we went high enough that the Dade Lake appeared in the valley where it sits, and it was time to set up camp.</p>

<p>At this point, it was already getting windy. I should have checked the weather then, but instead, I was completely enthralled by BCS, especially its north arete,. In the preceding months, I have stared longingly at many photos of the north arete, and even painted it once. I had been dreaming of climbing it. But somehow this dream was not like other sources of motivation I have had. It was not a goal. By definition, dreaming for something is not already knowing what the experience would be like, so part of it didn’t feel real. It was as if I wanted to preserve the opportunity to be delightfully surprised if I achieved this. If this were the goal, then there’s no such feeling, only success or failure. Maybe I was protecting myself against devastating disappointment, because in the mountains, a lot of things are outside of my control. Deep down, I knew that if I didn’t climb BCS, it would still be okay. My happiness didn’t entirely depend on it.</p>

<p>I forgot the exact order of events which included piling up rocks to hold our tents in place, checking my satellite device for weather, and practicing traversing snow fields with crampon and ice axe. The latter must have come first. And I remembered feeling good with crampons and feeling the buzz of excitement about getting closer and closer to setting foot on the spire. Then all I remembered next was using my rational thoughts to dam a huge wave of disappointment brought on by the forecast of devastatingly high wind on our summit day. “we’d be blown off the ridge if we went up”, the guide said. I remember going back to my tent, which was flapping loudly in the wind, and lying huddled in my sleeping back. As it was getting dark outside, I kept telling myself that “it’s totally normal to run into unexpected weather in the mountains, this is real alpinism”, “maybe the weather will be different tomorrow morning”, “I have enjoyed everything so far”. But finally the dam broke. “Earlier, I was tolerating the heavy packs thus far so I could have a chance to summit.” “No, I would not have just done the backpacking portion of this trip without the climbing.” “Yes, summiting is the cherry on top of the journey of getting there, but I wanted that cherry very much!” In the midst of this disappointment, also feeling rejected, I caught myself try to make sense of this turndown - “Did I do something wrong? Did I not train enough? Why else am I punished?”</p>

<p>After quite a bit of intense self-dialoguing, I eventually began to accept that the weather is not in my control. What I could do at that moment was to choose whether or not to climb in this weather. It was really a decision between life and probable death (getting blown off the ridge). And it was a no brainer to choose to live. There didn’t seem to be much point to question why the wind came so suddenly and so strongly. According to the guide Paloma, the weather in the high Sierra has been more erratic than usual. There had been unexpected snow, hail, and thunderstorms. “Non-stop” is how she described the fall season this year, while fall is usually the calm and clear stretch of year. Could this be the tangible impact of climate change affecting my specific summit day? Or was mountain weather just like this? The Rockies has always had more unpredictable mountain weather, maybe the dry and calm Sierra weather we had for several years was the anomaly.</p>

<p>When my inner dialogue concluded, it was almost completely dark outside. Holding on to the last glimpse of hope, I reached an agreement with the guide to wake up at 4am and check the weather again. Then I took an ibuprofen and drank lots of electrolyte water to deal with my elevation-related headache, and drifted to sleep. 
That night my consciousness wandered in and out of deep sleep. After I felt long enough time had passed, I pulled myself out of sleep and checked the clock, it’s 3:30am. I felt decently rested. It turned out that when you eat dinner at 5 and sleep right at dusk, waking up around 4 isn’t so bad. But one thing was clear, the tent was getting tossed around a little by the wind. The weather forecast was dead-on this time - it was too windy to climb. Despite knowing this, I still dressed myself and walked outside to confer with the Guide. Even though the anticipated no-go was not a happy decision, I was delightfully surprised by the extra clear starry sky, bright moon, and the Milky Way. At least I could see the stars. I looked up at the night sky as much as I could stand in the high wind, and then went back to sleep.</p>

<p>The following morning was uneventful, we just hiked out. I carried the rope down, making my pack ~40 pounds. I thought I was not going to make it in the first 10 minutes, but after a while, my body adjusted, and it was completely okay. We met many weekend backpackers and fishing crowds on the way down. One of the fishermen was a 88 year old man who had been coming to the Long Lake to fish his entire life, after he learned that we attempted the most prominent peak in the valley, he called us “radical”, and said “you should come back when you are 88”. I gladly took his advice. What a wise perspective!</p>

<p>This is now my new dream, to be healthy and strong enough to climb BCS at 88!</p>]]></content><author><name></name></author><category term="outdoors" /><summary type="html"><![CDATA[]]></summary></entry></feed>