# Árvores de Decisão (Decision Trees)

### 📌 Classificação

* **Categoria:** Aprendizado de Máquina / Supervisionado
* **Tipo:** Classificação e Regressão
* **Criado por:** Vários pesquisadores (ID3: Quinlan, 1986; C4.5: Quinlan, 1993; CART: Breiman, 1984)
* **Complexidade:** O(n × m × log n) para treino, O(log n) para predição
* **Status:** ✅ Clássico, interpretável, amplamente utilizado

***

### 🎯 Descrição Geral

**Árvores de Decisão** são modelos de aprendizado de máquina que usam uma estrutura hierárquica (árvore) para tomar decisões. Cada nó interno representa um teste em um atributo, cada ramo representa o resultado do teste, e cada folha representa uma classe (classificação) ou valor (regressão).

```
                    ┌─────────────┐
                    │  Idade?     │
                    │  > 30?      │
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            ▼              ▼              ▼
      ┌──────────┐   ┌──────────┐   ┌──────────┐
      │ < 18     │   │ 18-30    │   │ > 30     │
      │ Recusar  │   │ Renda?   │   │ Aprovar  │
      └──────────┘   └────┬─────┘   └──────────┘
                          │
                    ┌─────┴─────┐
                    ▼           ▼
              ┌──────────┐ ┌──────────┐
              │ < 5000   │ │ >= 5000  │
              │ Recusar  │ │ Aprovar  │
              └──────────┘ └──────────┘
```

#### Analogia

> *"Uma árvore de decisão é como um fluxograma de perguntas e respostas: 'Tem diploma? Sim → Tem experiência? Sim → Contratar. Não → Treinar.' Cada resposta leva a uma nova pergunta até chegar a uma decisão final."*

***

### ⚙️ Conceitos Fundamentais

#### Estrutura da Árvore

| Componente     | Descrição                                 | Exemplo              |
| -------------- | ----------------------------------------- | -------------------- |
| **Nó Raiz**    | Primeiro teste (atributo mais importante) | "Idade > 30?"        |
| **Nó Interno** | Teste em um atributo                      | "Renda > 5000?"      |
| **Ramo**       | Resultado do teste                        | "Sim", "Não"         |
| **Nó Folha**   | Decisão final                             | "Aprovar", "Recusar" |

#### Métricas de Impureza (Escolha do Melhor Split)

| Métrica                   | Fórmula              | Faixa     | Melhor Valor |
| ------------------------- | -------------------- | --------- | ------------ |
| **Gini**                  | 1 - Σ p\_i²          | \[0, 0.5] | 0 (puro)     |
| **Entropia**              | -Σ p\_i × log₂(p\_i) | \[0, 1]   | 0 (puro)     |
| **Erro de Classificação** | 1 - max(p\_i)        | \[0, 0.5] | 0 (puro)     |

#### Ganho de Informação (Information Gain)

```
Ganho = Impureza(pai) - Σ (|filho|/|pai| × Impureza(filho))
```

***

### 🌳 Algoritmos de Árvore de Decisão

| Algoritmo | Ano  | Tipo                    | Divisão             | Suporte a Regressão |
| --------- | ---- | ----------------------- | ------------------- | ------------------- |
| **ID3**   | 1986 | Classificação           | Ganho de Informação | ❌ Não               |
| **C4.5**  | 1993 | Classificação           | Gain Ratio          | ❌ Não               |
| **CART**  | 1984 | Classificação/Regressão | Gini                | ✅ Sim               |
| **CHAID** | 1980 | Classificação           | Chi-quadrado        | ❌ Não               |

***

### 📝 Implementação

#### Implementação Manual (Classificação)

```python
import numpy as np
from collections import Counter

class DecisionTreeNode:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Índice da feature para split
        self.threshold = threshold  # Valor do threshold
        self.left = left            # Subárvore esquerda (≤ threshold)
        self.right = right          # Subárvore direita (> threshold)
        self.value = value          # Valor da folha (classe)

class DecisionTreeClassifier:
    def __init__(self, max_depth=None, min_samples_split=2, criterion='gini'):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion  # 'gini' ou 'entropy'
        self.root = None
    
    def fit(self, X, y):
        """Treina a árvore de decisão"""
        self.n_classes = len(np.unique(y))
        self.n_features = X.shape[1]
        self.root = self._grow_tree(X, y, depth=0)
    
    def _gini(self, y):
        """Calcula o índice Gini"""
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions ** 2)
    
    def _entropy(self, y):
        """Calcula a entropia"""
        proportions = np.bincount(y) / len(y)
        return -np.sum([p * np.log2(p) for p in proportions if p > 0])
    
    def _impurity(self, y):
        """Calcula a impureza baseada no critério escolhido"""
        if self.criterion == 'gini':
            return self._gini(y)
        else:
            return self._entropy(y)
    
    def _best_split(self, X, y):
        """Encontra o melhor split para os dados"""
        m, n = X.shape
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        current_impurity = self._impurity(y)
        
        for feature in range(n):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                # Dividir os dados
                left_mask = X[:, feature] <= threshold
                right_mask = ~left_mask
                
                if np.sum(left_mask) < self.min_samples_split or np.sum(right_mask) < self.min_samples_split:
                    continue
                
                y_left = y[left_mask]
                y_right = y[right_mask]
                
                # Calcular impureza ponderada após o split
                impurity_left = self._impurity(y_left)
                impurity_right = self._impurity(y_right)
                weighted_impurity = (len(y_left) * impurity_left + len(y_right) * impurity_right) / len(y)
                
                # Calcular ganho de informação
                gain = current_impurity - weighted_impurity
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
    
    def _grow_tree(self, X, y, depth):
        """Cresce a árvore recursivamente"""
        # Verificar critérios de parada
        if len(np.unique(y)) == 1:  # Todos da mesma classe
            return DecisionTreeNode(value=y[0])
        
        if depth == self.max_depth:  # Profundidade máxima atingida
            return DecisionTreeNode(value=Counter(y).most_common(1)[0][0])
        
        if len(y) < self.min_samples_split:  # Poucas amostras
            return DecisionTreeNode(value=Counter(y).most_common(1)[0][0])
        
        # Encontrar melhor split
        feature, threshold, gain = self._best_split(X, y)
        
        if feature is None:  # Não encontrou split válido
            return DecisionTreeNode(value=Counter(y).most_common(1)[0][0])
        
        # Dividir os dados
        left_mask = X[:, feature] <= threshold
        right_mask = ~left_mask
        
        X_left, y_left = X[left_mask], y[left_mask]
        X_right, y_right = X[right_mask], y[right_mask]
        
        # Crescer subárvores
        left = self._grow_tree(X_left, y_left, depth + 1)
        right = self._grow_tree(X_right, y_right, depth + 1)
        
        return DecisionTreeNode(feature=feature, threshold=threshold, left=left, right=right)
    
    def predict(self, X):
        """Faz predições para novos dados"""
        return np.array([self._predict_sample(x, self.root) for x in X])
    
    def _predict_sample(self, x, node):
        """Predição para uma amostra"""
        if node.value is not None:
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._predict_sample(x, node.left)
        else:
            return self._predict_sample(x, node.right)

# Exemplo de uso
if __name__ == "__main__":
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    # Carregar dados
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # Dividir treino/teste
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Treinar árvore
    tree = DecisionTreeClassifier(max_depth=5, criterion='gini')
    tree.fit(X_train, y_train)
    
    # Predizer
    y_pred = tree.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Acurácia: {accuracy:.2f}")
```

***

### 💻 Implementação com Scikit-learn

#### Classificação

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Carregar dados
iris = load_iris()
X, y = iris.data, iris.target

# Criar e treinar modelo
clf = DecisionTreeClassifier(
    criterion='gini',      # 'gini' ou 'entropy'
    max_depth=5,           # Profundidade máxima
    min_samples_split=2,   # Mínimo amostras para split
    min_samples_leaf=1,    # Mínimo amostras por folha
    random_state=42
)

clf.fit(X, y)

# Visualizar árvore
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

# Predizer
predicoes = clf.predict(X)
probabilidades = clf.predict_proba(X)  # Probabilidades das classes

# Importância das features
print("Importância das features:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.3f}")
```

#### Regressão

```python
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import numpy as np

# Gerar dados de regressão
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Criar e treinar modelo
regressor = DecisionTreeRegressor(
    max_depth=5,
    min_samples_split=5,
    random_state=42
)
regressor.fit(X, y)

# Predizer
X_test = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
y_pred = regressor.predict(X_test)

# Visualizar
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Dados')
plt.plot(X_test, y_pred, 'r-', label='Árvore de Decisão')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
```

***

### 📊 Exemplos Detalhados

#### Exemplo 1: Classificação de Risco de Crédito

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Criar dados fictícios
data = pd.DataFrame({
    'idade': [25, 35, 45, 22, 50, 30, 28, 40, 55, 32],
    'renda': [3000, 5000, 8000, 2000, 10000, 4500, 3500, 6000, 12000, 4000],
    'divida': [500, 1000, 2000, 800, 500, 1500, 600, 1800, 3000, 900],
    'aprovado': [0, 1, 1, 0, 1, 0, 0, 1, 1, 1]  # 0: recusado, 1: aprovado
})

# Features e target
X = data[['idade', 'renda', 'divida']]
y = data['aprovado']

# Dividir dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Treinar árvore
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Avaliar
y_pred = clf.predict(X_test)
print("Matriz de Confusão:")
print(confusion_matrix(y_test, y_pred))
print("\nRelatório de Classificação:")
print(classification_report(y_test, y_pred))

# Interpretar a árvore
from sklearn.tree import export_text
print("\nEstrutura da Árvore:")
print(export_text(clf, feature_names=['idade', 'renda', 'divida']))
```

#### Exemplo 2: Predição de Sobrevivência no Titanic

```python
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Carregar dados do Titanic
titanic = sns.load_dataset('titanic')

# Preparar dados
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']
X = titanic[features].copy()
y = titanic['survived']

# Tratar valores faltantes
X['age'].fillna(X['age'].median(), inplace=True)

# Codificar variáveis categóricas
le = LabelEncoder()
X['sex'] = le.fit_transform(X['sex'])  # male=1, female=0

# Dividir
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Treinar
clf = DecisionTreeClassifier(max_depth=4, random_state=42)
clf.fit(X_train, y_train)

print(f"Acurácia: {clf.score(X_test, y_test):.2f}")

# Mostrar regras
print("\nRegras da Árvore:")
print(export_text(clf, feature_names=features))
```

***

### 🎨 Visualização e Interpretação

#### Regras de Decisão

```python
from sklearn.tree import export_text

# Extrair regras como texto
rules = export_text(clf, feature_names=list(X.columns))
print(rules)

# Exemplo de saída:
# |--- sex <= 0.50
# |   |--- age <= 9.50
# |   |   |--- class: 1
# |   |--- age > 9.50
# |   |   |--- pclass <= 2.50
# |   |   |   |--- class: 1
# |   |   |--- pclass > 2.50
# |   |   |   |--- class: 0
# |--- sex > 0.50
# |   |--- fare <= 26.00
# |   |   |--- class: 0
# |   |--- fare > 26.00
# |   |   |--- class: 1
```

#### Importância das Features

```python
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(8, 5))
plt.title("Importância das Features")
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), [X.columns[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()
```

***

### 🎯 Vantagens e Desvantagens

#### Vantagens

| Vantagem                    | Descrição                        |
| --------------------------- | -------------------------------- |
| **Interpretabilidade**      | Fácil de entender e explicar     |
| **Pouco pré-processamento** | Não precisa normalizar dados     |
| **Lida com dados mistos**   | Numéricos e categóricos          |
| **Não linear**              | Captura relações complexas       |
| **Feature importance**      | Identifica variáveis importantes |

#### Desvantagens

| Desvantagem          | Descrição                                  | Mitigação                 |
| -------------------- | ------------------------------------------ | ------------------------- |
| **Overfitting**      | Árvores profundas memorizam ruído          | Podar árvore (max\_depth) |
| **Instabilidade**    | Pequenas mudanças criam árvores diferentes | Random Forest             |
| **Viés**             | Prefere features com muitos valores        | Balanceamento             |
| **Otimização local** | Splits gulosos (não global)                | Ensemble methods          |

***

### 🔧 Hyperparâmetros

| Parâmetro               | Valores Típicos   | Efeito                                |
| ----------------------- | ----------------- | ------------------------------------- |
| **max\_depth**          | 3-10              | Controla complexidade (↓ overfitting) |
| **min\_samples\_split** | 2-20              | Mínimo para split (↑ generalização)   |
| **min\_samples\_leaf**  | 1-10              | Mínimo por folha (↑ suavidade)        |
| **max\_features**       | 'sqrt', 'log2'    | Features consideradas por split       |
| **criterion**           | 'gini', 'entropy' | Métrica de impureza                   |

#### Otimização com GridSearch

```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Melhores parâmetros: {grid_search.best_params_}")
print(f"Melhor acurácia: {grid_search.best_score_:.3f}")
```

***

### 🌲 Poda (Pruning) de Árvores

#### Pre-pruning (Durante a construção)

```python
# Limitar profundidade
clf = DecisionTreeClassifier(max_depth=5)

# Limitar amostras por folha
clf = DecisionTreeClassifier(min_samples_leaf=10)

# Limitar amostras para split
clf = DecisionTreeClassifier(min_samples_split=20)
```

#### Post-pruning (Após construção - Cost Complexity)

```python
from sklearn.tree import DecisionTreeClassifier

# Treinar árvore completa
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Obter valores de alpha para poda
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Treinar árvores com diferentes alphas
clfs = []
for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    clf.fit(X_train, y_train)
    clfs.append(clf)

# Escolher o melhor alpha (validação)
# ...
```

***

### 📈 Performance vs Interpretabilidade

```
Interpretabilidade
       ▲
       │
Alta   │  Árvores de Decisão (regras simples)
       │
Média  │  Random Forest (difícil interpretar)
       │
Baixa  │  Redes Neurais (caixa preta)
       │
       └────────────────────────────────────► Performance
                Baixa              Alta
```

***

### 🔗 Links Úteis

* [Scikit-learn Decision Trees](https://scikit-learn.org/stable/modules/tree.html)
* [Interactive Decision Tree Visualization](https://www.cs.ubc.ca/~murphyk/Software/forest/decisionTree.html)

***

### 📚 Livros Recomendados

| Livro                                | Autor              | Foco            |
| ------------------------------------ | ------------------ | --------------- |
| The Elements of Statistical Learning | Hastie, Tibshirani | Teoria avançada |
| Introduction to Machine Learning     | Ethem Alpaydin     | Fundamentos     |
| Machine Learning with Scikit-Learn   | Géron              | Prático         |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://0xmorte.gitbook.io/bibliadopentestbr/conceitos/algoritmos/aprendizado-de-maquina-machine-learning/arvores-de-decisao-decision-trees.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.