Condition interference in rats performing a choice task with switched variable- and fixed-reward conditions

Funamizu, Akihiro; Ito, Makoto; Doya, Kenji; Kanzaki, Ryohei; Takahashi, Hirokazu

doi:10.3389/fnins.2015.00027

ORIGINAL RESEARCH article

Front. Neurosci. , 13 February 2015

Sec. Decision Neuroscience

Volume 9 - 2015 | https://doi.org/10.3389/fnins.2015.00027

This article is part of the Research Topic News from the Psychology, Neurobiology and Genetic Fields on Social and Economic Behavioral Studies View all 8 articles

Condition interference in rats performing a choice task with switched variable- and fixed-reward conditions

$\r\nAkihiro Funamizu,*$ Akihiro Funamizu^1,2^*

Makoto Ito¹

Kenji Doya¹

Ryohei Kanzaki^2,3

Hirokazu Takahashi^2,3

¹Neural Computation Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
²Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
³Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan

Because humans and animals encounter various situations, the ability to adaptively decide upon responses to any situation is essential. To date, however, decision processes and the underlying neural substrates have been investigated under specific conditions; thus, little is known about how various conditions influence one another in these processes. In this study, we designed a binary choice task with variable- and fixed-reward conditions and investigated neural activities of the prelimbic cortex and dorsomedial striatum in rats. Variable- and fixed-reward conditions induced flexible and inflexible behaviors, respectively; one of the two conditions was randomly assigned in each trial for testing the possibility of condition interference. Rats were successfully conditioned such that they could find the better reward holes of variable-reward-condition and fixed-reward-condition trials. A learning interference model, which updated expected rewards (i.e., values) used in variable-reward-condition trials on the basis of combined experiences of both conditions, better fit choice behaviors than conventional models which updated values in each condition independently. Thus, although rats distinguished the trial condition, they updated values in a condition-interference manner. Our electrophysiological study suggests that this interfering value-updating is mediated by the prelimbic cortex and dorsomedial striatum. First, some prelimbic cortical and striatal neurons represented the action-reward associations irrespective of trial conditions. Second, the striatal neurons kept tracking the values of variable-reward condition even in fixed-reward-condition trials, such that values were possibly interferingly updated even in the fixed-reward condition.

Introduction

The cortico-basal ganglia circuit is involved not only in movement control, but also in inference-, experience- and reward-based decision making (Hikosaka et al., 1999; Daw et al., 2005; Cohen et al., 2007; Doya, 2008; Ito and Doya, 2011). Many anatomical and functional studies suggest that this diverse set of functions is simultaneously implemented in parallel in the circuit [anatomy: (Haber, 2003; Voorn et al., 2004; Gruber and McDonald, 2012); function: (Tanaka et al., 2004; Balleine et al., 2007; Yamin et al., 2013)]. A typical example of this parallel circuit is the neural implementation of response-outcome (R-O) and stimulus-response (S-R) associations: the former association is driven by the medial part of the circuit, including the prelimbic cortex and the dorsomedial striatum, for producing flexible learning behaviors (Corbit and Balleine, 2003; Yin et al., 2005a,b), while the latter association is implemented in the infralimbic cortex and the dorsolateral striatum to execute inflexible behaviors (Yin et al., 2004; Balleine and Killcross, 2006).

In the parallel decision-making circuits, humans and animals select actions in various situations. The abilities to anticipate and store outcomes of options in any situation are crucial. Despites its importance in action learning, decision processes and neural substrates involved in various situations are still unclear, partly because behavioral experiments have usually been designed to eliminate situational effects as far as possible, for the sake of simplicity. These past studies may hypothesize that outcome estimation in each condition is independently processed; however, humans often cannot perform two tasks at once without interference (Monsell, 2003). This task-switching cost predicts that the cortico-basal ganglia circuit contains some conditional interferences.

Decision processes in the cortico-basal ganglia circuit are theoretically explained by the reinforcement learning framework (Corrado and Doya, 2007; O'Doherty et al., 2007; Doya, 2008). The framework has two steps for decisions: value updating, in which agents update the expected rewards (i.e., values) with past actions and rewards, and action selection, in which agents select actions based on the values (Sutton and Barto, 1998). Although task conditions are considered independently in classical reinforcement learning theories, we hypothesize that decision making under various conditions leads to some interference among conditions in value updating and/or action selection. Especially when interference occurs in value updating, its neural correlates may be observed in the striatum, because the striatum is known to represent and store action values (Samejima et al., 2005; Lau and Glimcher, 2008).

Using rats, we conducted a choice task with a random trial sequence of variable- and fixed-reward conditions to test whether rats had condition interference. Variable- and fixed-reward conditions were designed to investigate flexible and inflexible behaviors, respectively; reward probabilities in the variable-reward condition varied between blocks of trials, while they were fixed in the fixed-reward condition. Neural activities of the prelimbic cortex and dorsomedial striatum (i.e., a candidate flexible-behavior network) were electrophysiologically recorded to investigate neural substrates of condition interference. We used rats because their parallel cortico-basal ganglia circuits for decision making are well examined and established (Voorn et al., 2004; Balleine, 2005). Results of this study from reinforcement learning models suggested that, although rats distinguish the trial conditions, they update values in a condition-interference manner. Some striatal neurons represented values required for the variable-reward condition even during fixed-reward-condition trials, suggesting that these representations caused the condition interference between flexible and inflexible behaviors.

Materials and Methods

All procedures were approved by the institutional committee at the University of Tokyo and performed in accordance with the “Guiding Principles for the Care and Use of Animals in the Field of Physiological Science” of the Japanese Physiological Society. We used five male Long-Evans rats (240–380 g); two rats performed both the behavioral and electrophysiological experiments, and the remaining three rats performed only the behavioral experiments. Food was provided after the task to maintain animal body weight at no less than 85% of the initial level. Water was supplied freely.

Behavioral Task

All experiments were conducted in a 36 × 36 × 37 cm experimental chamber (O'Hara & Co. Ltd.) placed in a sound-attenuating box (Funamizu et al., 2012). The experimental chamber had three nose-poke holes on one wall and a pellet dish on the opposite side of the chamber (Figure 1). Four light emitting diode (LED) high-intensity lamps (white) were placed above the center hole for light stimuli. A speaker was placed above the chamber for sound stimuli. All durations of poking, presence, and consuming of pellets were captured with infrared sensors and were recorded with a sampling rate of 1 kHz (Cyberkinetics Inc.; Cerebus Data Acquisition System).

FIGURE 1

Figure 1. Free choice task. Variable- and fixed-reward conditions were randomly assigned for each trial in a 70% / 30% ratio, respectively. In both conditions, each trial was initiated when a rat poked its nose into the center hole (C), after it had to keep poking for 1600–2600 ms until a “go” tone sounded (Hold). During the fixed-reward condition only, a short light stimulus was presented during the center-hole poking to inform rats that the trial was a fixed-reward-condition trial. After the presentation of “go” tone (Go), rats had to choose either a left (L) or a right (R) hole, and a reward of a food pellet was dispensed stochastically (D) (Choice). In the variable-reward condition, reward probabilities were selected randomly (90–50%, 50–10%, 10–50%, and 50–90% for left-right) and the reward setting changed based on the choice performance of rat. In the fixed-reward condition, the reward probability was constant with either 90–50% or 50–90% for all the sessions, and the reward setting was pre-determined for each rat.

Our task had variable- and fixed-reward conditions; one of the conditions was randomly assigned for each trial with proportion of 70% and 30%, respectively (Figure 1). Only in fixed-reward-condition trials, a light stimulus was presented to inform rats of the trial condition. In each trial, rats first performed a nose-poke in the center hole, and they continued poking until a “go” tone with a frequency of 5 kHz, an intensity of 50 dB SPL (sound pressure level in decibels with respect to 20 μ Pa) and a duration of 500 ms was presented. In the fixed-reward condition, a light stimulus was presented for about 600 ms immediately after the initiation of center-hole poking. If rats failed to continue poking until the presentation of the “go” tone, an error tone was presented (1 kHz, 70 dB SPL, 50 ms), and the trial was scored as an error. After the presentation of “go” tone, rats selected either the left or right hole within 15 s and received a reward of a food pellet (25 mg), presented stochastically. A reward tone (20 kHz, 70 dB SPL, 2000 ms) was presented immediately after the choice in a rewarded trial. In contrast, a no-reward tone (1 kHz, 70 dB SPL, 50 ms) was presented in a non-rewarded trial. If rats did not select choices within 15 s from the presentation of the “go” tone, the error tone was also presented, as in the error trial.

In the variable-reward condition, the reward probability of each choice changed among four settings: 90–50%, 50–90%, 50–10%, and 10–50% in regard to left-right choices. Variable-reward-condition trials with the same reward-probability setting were referred to as a block; a block consisted of at least 20 trials. Subsequently, the block was changed when the rat selected the more rewarding hole in ≥80% of the last 20 variable-reward-condition trials (Ito and Doya, 2009; Funamizu et al., 2012). The block change was conducted so as to (i) include all four reward-probability settings in each of the four blocks and (ii) not to repeat any of the settings. Each rat performed at least four blocks per day (i.e., per session) and any sessions consisting of fewer than five blocks were excluded from the analysis.

In the fixed-reward condition, the reward probability was constant in all sessions, and was set to either 90–50% or 50–90% in the left-right choices for each rat. Each rat selected the more-rewarding choice more than 80% through a session in fixed-reward condition, and any sessions in which rats failed to select the optimal choice were not used in the analysis.

Thus, our task required the rats to select the more-rewarding hole ≥80% of the time in both variable- and fixed-reward conditions. Therefore, the rats needed to distinguish the trial type in order to achieve the 80% correct-choice criterion, when the more-rewarding holes of variable- and fixed-reward conditions were different.

In both the variable- and fixed-reward conditions, we provided an extinction phase which never presented a reward for choices in a random sequence of five variable-reward-condition trials and five fixed-reward-condition trials (i.e., successive 10 trials in total) to characterize the behaviors in variable- and fixed-reward conditions. The extinction phase was conducted after the reward probability of variable-reward-condition block was identical to that of fixed-reward condition. In the extinction phase, we investigated the sensitivity to this treatment from the choice preferences of rats. Flexible or inflexible behaviors should change or retain choices with the outcome extinction, respectively.

Surgery

After rats practiced the free choice task, they were anesthetized with sodium pentobarbital (50 mg/kg, i.p.) and placed in a stereotaxic frame (Narishige). Atropine sulfate (0.1 mg/kg) was also administered at the beginning of the surgery to reduce the viscosity of bronchial secretions (Takahashi et al., 2011; Funamizu et al., 2013). The cranium and dura over recording sites were removed and four small craniotomies were conducted for anchoring screws. The screws were used for the ground electrode in electrophysiology. Two drivable parallel electrode bundles were inserted into the prelimbic cortical site in the right hemisphere (2.5 mm in anterior-posterior (AP) and 0.55 mm in medio-lateral (ML) from the bregma with a depth of 2.5 mm from the surface of brain). The three electrode bundles were inserted into the dorsomedial striatum site in the right hemisphere (0.2 mm in AP, 2.0–3.0 mm in ML with a depth of 3.4 mm) (Stalnaker et al., 2010; Wang et al., 2013). Each electrode bundle was lowered 125 μm after each session such that we could get new neurons in every session (Ito and Doya, 2009). The bundle was composed of seven or eight Formvar-insulated nichrome wires with the bare diameter of 25 μm (A-M Systems). The wires were inserted into a stainless-steel guide cannula with an outer diameter of 0.3 mm. The tip of each wire was electroplated with gold to obtain an impedance of 100–200 kΩ at 1 kHz. In total, five electrode bundles were inserted in the brain, and 14 and 24 wires were inserted in the prelimbic cortex and dorsomedial striatum, respectively.

Electrophysiological Recording

During the choice task, recorded neural signals were amplified and stored with a 62-ch multiplexer neural-recording system (Triangle biosystems international; TBSI) and a Cerebus data acquisition system (Cyberkinetics Inc.) with an amplified gain of 1000, a band-pass filter of 0.3–7500 Hz, and a sampling frequency of 30 kHz. We then applied an offline digital high-pass filter of 200 Hz (Matlab; The Mathworks). When the signal became below or above its root mean square (RMS) times 5.5, the signal was defined as spike activity (Torab et al., 2011). Offline spike sorting was conducted using Spike 2 (CED), with which spike waveforms were classified into several groups based on template matching. Groups of waveforms that appeared to be action potentials were accepted, while all others were discarded.

Histology

After electrophysiological recording, rats were anesthetized with sodium pentobarbital (50 mg/kg, i.p.), and a positive current of 10 μ A was passed for 10–20 s through one or two electrodes of each bundle to mark the final recording positions (Ito and Doya, 2009). Rats were perfused with 10% formalin containing 3% potassium hexacyanoferrate (II), and the brain was carefully removed from the cranial bone. Sections were cut at 90 μm with a vibratome (DTK-2000, D.S.K.) and stained with cresyl violet. The position of each recorded neuron was estimated from the final position and the distance that the bundle was moved. If the position was outside the prelimbic cortex or dorsomedial striatum, the data were discarded.

Behavioral Analysis

In the analyses of behaviors during the choice task, error trials (in which rats failed to keep poking in the center hole, or took more than 15 s to select the left or right hole) were removed, and the remaining sequences of successful trials (in which rats successfully made a left or right choice) were used.

Model-free analysis

We first analyzed choice preferences during the extinction phase to identify whether rats had flexible or inflexible behaviors in the variable- and fixed-reward conditions. We then assessed the interference of variable- and fixed-reward conditions in the choice behaviors. We compared conditional choice probabilities between two trial sequences: repeated sequences [e.g., variable-reward-condition trial to variable-reward-condition trial (Var. – Var.)], in which probabilities were calculated based on the action-outcome experience in the last trial with a same condition; and interleaved sequences (e.g., Var. – Fix. – Var.), in which probabilities were calculated based on the experience in the next-to-last trial with the same condition, so that the last different-condition trial was ignored (Figure 4Bi). If the choice of each condition was independently learned and the interleaved trial caused no interference, conditional probabilities in the two trial sequences became the same.

Model-based analysis

We analyzed choice behaviors of rats with reinforcement learning models and a fixed-choice model to test (i) whether interference occurred in choice learning, and (ii) whether it occurred in the value updating or action selection phase. We denoted the action as a ∈ [L (left), R (right)], the reward as r ∈ [1, 0] and the condition as C ∈ [V (variable), F (fixed)]. We assumed that rats predicted the expected reward of each choice (i.e., action value) in each condition, Q_{a, C}: rats had four action values in total. A choice probability was predicted with the following soft-max equation based on the action values:

\begin{matrix} \begin{array}{l} P (a (t) = L) = \\ \frac{1}{\begin{array}{l} 1 + e x p [Q_{R, C (t)} (t) - Q_{L, C (t)} (t) + G_{C (t)} ​ {Q_{R, \bar{C}} (t) - Q_{L, \bar{C}} (t)} ​] ​ \end{array}}, \end{array} & (1) \end{matrix}

where C(t) and C were trial and non-trial conditions, i.e., C ≠ C(t); for example, when the presented trial was a variable-reward condition, C was a fixed-reward condition. G_C(t) was a free parameter depending on the trial condition. This parameter adjusted the contribution of action values of a non-trial condition in the choice prediction.

A fixed-choice model had the action value as a free parameter, assuming a constant value in all trials:

\begin{matrix} {\begin{array}{l} \begin{array}{l} Q_{R, C} = q_{C} \\ Q_{L, C} = 1 - q_{C} \end{array} \end{array}, & (2) \end{matrix}

where q_C was a free parameter depending on the value condition. If the fixed-choice model fit a choice behavior, the behavior had no-learning and no condition-interference in value updating.

Figure 2 shows the scheme of proposed reinforcement learning models. We updated the action value in each condition, Q_{a, C}, in accordance with Ito and Doya (2009):

\begin{matrix} \begin{array}{l} Q_{a, V} (t + 1) = \\ {\begin{cases} (1 - α_{1, C (t), V}) Q_{a, V} (t) + α_{1, C (t), V} k_{1} i f a = a (t), r (t) = 1 \\ (1 - α_{1, C (t), V}) Q_{a, V} (t) - α_{1, C (t), V} k_{2} i f a = a (t), r (t) = 0 \\ (1 - α_{2, C (t), V}) Q_{a, V} (t) i f a \neq a (t) \end{cases} \\ Q_{a, F} (t + 1) = \\ {\begin{cases} (1 - α_{1, C (t), F}) Q_{a, F} (t) + α_{1, C (t), F} k_{1} i f a = a (t), r (t) = 1 \\ (1 - α_{1, C (t), F}) Q_{a, F} (t) - α_{1, C (t), F} k_{2} i f a = a (t), r (t) = 0 \\ (1 - α_{2, C (t), F}) Q_{a, F} (t) i f a \neq a (t), \end{cases} \end{array} & (3) \end{matrix}

where a(t), r(t) and C(t) were the action, reward, and condition at trial t, respectively. Action values of both variable- and fixed-reward conditions were updated every trial, irrespective of the trial condition. α₁, α₂, k₁, and k₂ were free parameters. α₁ showed the learning rate in the chosen option, and α₂ showed the forgetting rate in the un-chosen option. k₁ and k₂ indicated the strengths of reinforcers in reward and non-reward outcomes, respectively. α₁ and α₂ depended on the trial condition, C(t), and the action-value condition, C, to capture differences in (i) learning of variable- and fixed-reward conditions, and (ii) learning by its own condition and by the other condition. Equation (3) had 10 parameters in total.

FIGURE 2

Figure 2. Interference reinforcement learning model. Our reinforcement learning models assumed that rats estimated expected rewards (values) of left and right choices in both variable- and fixed-reward conditions, i.e., Q_V and Q_F; models had four action values in total. All action values were updated both in variable-reward-condition (A) and fixed-reward-condition (B) trials. α was the learning rate or forgetting rate in the selected or unselected option, respectively; α depended on the trial condition and value condition. k was the reinforcer strength of the outcome. A choice probability was predicted with a soft-max equation based on all values. The soft-max equation had a free parameter, G, which adjusted the contribution of action values from the non-trial condition in the choice prediction.

Equation (3) could take a variety of updating rules by selecting utilized parameters, so that updating rules for the values of variable- and fixed-reward conditions (the upper and lower part of Equation 3, respectively) could be different. When we set α₂ = k₂ = 0, the equation became a standard Q-learning (Q-learning) (Watkins and Dayan, 1992; Sutton and Barto, 1998). We referred to the equation with α₁ = α₂ as a forgetting Q-learning (FQ-learning), and we referred to the full-parameter equation as a differential forgetting Q-learning (DFQ-learning) (Ito and Doya, 2009).

When we set α_{1, C, C} = α_{2, C, C} = 0 where C ≠ C in value updating (Equation 3) and G_C = 0 in action selection (Equation 1), the equations deal with variable- and fixed-reward conditions independently; we referred to the model as an independent model. When we set α_{1, C, C} = α_{2, C, C} = 0 in Equation (3), the model independently updated action values of each condition, but interferingly predicted the choices; we referred to it as an action interference model. Also, when we set G_C = 0 in Equation (1), the model interferingly updated action values of the variable- and fixed-reward conditions; we referred to it as a learning interference model.

Initial action values for reinforcement learning models were 0.5 in the left and right choices of the variable-reward condition (i.e., the average reward probability of the four reward-probability settings), and were 0.9 and 0.5 in the optimal and non-optimal choices of the fixed-reward condition.

Model comparison

We employed the normalized likelihood to test how well the models fit the choice behaviors of rats (Ito and Doya, 2009; Funamizu et al., 2012). The normalized likelihood, Z, was defined as follows:

\begin{matrix} Z = {[\prod_{t = 1}^{N} z (t)]}^{\frac{1}{N}}, & (4) \end{matrix}

where N and z(t) were the number of trials and the likelihood at trial t, respectively. The likelihood, z(t), was defined as follows, with the predicted left choice probability P(a(t) = L):

\begin{matrix} z (t) = {\begin{cases} P (a (t) = L) i f a (t) = L \\ 1 - P (a (t) = L) i f a (t) = R \end{cases} . & (5) \end{matrix}

We conducted a 2-fold cross validation for model comparison. In the cross validation, all sessions analyzed were divided into two equal groups. One group provided the training data, and the other group provided the validation data. The free parameters of each model were determined such that the normalized likelihood of the training data was maximized. With the determined parameters, the normalized likelihood of each session in the validation data was analyzed. Then, we switched the roles of the two datasets and repeated the same procedure to obtain normalized likelihoods in all sessions. Cross-validation analysis implicitly took into account the penalty of the number of free parameters (Bishop, 2006).

Neural Analysis

Striatal neurons have often been classified into phasically and tonically active neurons (Lau and Glimcher, 2008; Kim et al., 2009); however, our recording could not find clear criteria to support the classification, partly because the number of neurons recorded was too small. The following analyses were performed without the classification.

To test how neural activities in the prelimbic cortex and dorsomedial striatum were modulated during the task, we employed a stepwise multiple regression analysis (Matlab; Mathworks). Regression analysis was used to investigate neural correlates with actions, rewards, conditions, and associations. The analysis also detected neural correlates with the variables in a reinforcement learning model (Samejima et al., 2005; Ito and Doya, 2009). When the analysis was applied sequentially with a time window of 600 ms, advanced with a time step of 300 ms, we could capture the temporal dynamics of neural coding (Kim et al., 2009; Sul et al., 2011). The regression analysis was defined as follows:

\begin{matrix} \begin{array}{l} y (t) = β_{0} + β_{1} C (t) + β_{2} a (t) + β_{3} r (t) + β_{4 - 23} X (t) \\ + β_{24 - 28} M_{C = C (t)} (t) + β_{29 - 33} M_{C = V} (t) \\ + β_{34} C (t - 1) + β_{35} a (t - 1) + β_{36} r (t - 1) \\ + β_{37} T (t), \end{array} & (6) \end{matrix}

where β_0–37 were regression coefficients. y(t) was a spike count with a time window of 600 ms at trial t. C(t), a(t), and r(t) were the trial condition (a dummy variable of 1 or −1 for the variable- or fixed-reward condition, respectively), action (1 or −1 for the right or left choice), and reward (1 or −1 for the reward or non-reward outcome) at trial t, respectively. These variables at trial t − 1 were also included in the regression analysis as C(t − 1), a(t − 1), and r(t − 1). X(t) showed their interactions [i.e., C(t) × a(t), C(t) × r(t), a(t) × r(t), C(t) × a(t) × r(t)] with a dummy variable of 1 or −1; each interaction had 4, 4, 4, and 8 combinations, and the total was 20 combinations. When a neuron represented at least one combination of each interaction, we defined the neuron as interaction- or association-coding neuron. For example, when a neuron represented a combination of action and reward, i.e., a(t) × r(t), we defined the neuron as action-reward association coding. M_{C = C(t)} were the five model variables for the presented-trial condition, consisting of the action values (Q_{L, C(t)}, Q_{R, C(t)}), state value [P(a(t) = L) × Q_{L, C(t)} + (1 − P(a(t) = L)) × Q_{R, C(t)}], chosen value (Q_{a(t), C(t)}) and policy (Q_{L, C(t)} − Q_{R, C(t)}) (Lau and Glimcher, 2008; Ito and Doya, 2009; Sul et al., 2011). M_{C = V} were also model variables, but for the variable-reward condition. M_{C = V} were assumed to be tracked both in the variable-reward-condition and fixed-reward-condition trials in our reinforcement learning models (Equation 3). In contrast, values for the fixed-reward condition did not appear in the regression analysis, because the values were turned out to be constant and were difficult to capture with the analysis (see Results). T(t) was the trial number for detecting a slow drift of firing rate. When Equation (6) had significant regression coefficients (two-sided Student's t-test, p < 0.01), the neuron was defined as encoding the corresponding variables. In the model variables (i.e., M_{C = C(t)} and M_{C = V}), we could not get enough neurons encoding each individual variable, because of our sparse recording. Thus, we defined neurons as value coding when they encoded at least one of the five model variables. Model variables were derived from the proposed reinforcement-learning model in which free parameters were set to achieve the maximum likelihood in each session.

First, to investigate neural correlates of actions (i.e., responses: R), rewards (i.e., outcomes: O) and R-O associations, regression analysis was conducted only with neural activities during variable-reward-condition trials. By reducing the condition terms at trial t, Equation (6) became as follows:

\begin{matrix} \begin{array}{l} y (t) = β_{0} + β_{1} a (t) + β_{2} r (t) + β_{3 − 6} X (t) + β_{7 − 11} M_{C = V} (t) \\ + β_{12} C (t - 1) + β_{13} a (t - 1) + β_{14} r (t - 1) + β_{15} T (t) . \end{array} & (7) \end{matrix}

Second, to investigate neural correlates of conditions (i.e., stimuli: S) and S-O associations, we extracted trials in which rats selected the optimal side of fixed-reward condition. By focusing on the optimal side, we excluded a potential bias caused by the choice asymmetry in the fixed-reward condition in which rats mainly selected the optimal side. By reducing the action terms at trial t, Equation (6) became as follows:

\begin{matrix} \begin{array}{l} y (t) = β_{0} + β_{1} C (t) + β_{2} r (t) + β_{3 − 6} X (t) + β_{7 − 10} M_{C = C (t)} (t) \\ + β_{11 − 14} M_{C = V} (t) + β_{15} C (t - 1) + β_{16} a (t - 1) \\ + β_{17} r (t - 1) + β_{18} T (t) . \end{array} & (8) \end{matrix}

In Equation (8), model variables had 4 terms because the chosen value became identical to the action value in either a left or right choice. Third, to investigate value-coding neurons, the regression analysis of Equation (6) was applied to neural activities in all trials. Value-coding neurons were also investigated in fixed-reward-condition trials; in this case, Equation (7) was applied for fixed-reward-condition trials.