Course Assignment 1

Author

Yang Zhao - Karolinska Institutet

Published

2025-09-27

Task 1 - Literature

2a. What is the medically relevant insight from the article?

The article provides a significant medical insight into Systemic Lupus Erythematosus (SLE) by identifying two distinct gene expression patterns in the immune cells of SLE patients, including disease-state and disease-activity signature. Furthermore, the disease-activity signature is clinically relevant to personalized treatment.

2b. Which genomics technology/technologies were used?

The primary genomics technology used in this study was bulk RNA sequencing (RNA-seq). The study also integrated data from other genomics technologies, including genome-wide association studies (GWAS), expression quantitative trait loci (eQTL) and whole-genome sequencing (WGS).

3a. List and explain at least three questions/ hypotheses you can think of that extend the analysis presented in the paper.

  • The “disease-activity” signature can be used as a predictive biomarker for the response to a broader range of SLE therapies.
  • The cell-type-specific “disease-activity” signatures may differ significantly across diverse ethnic populations.
  • Long non-coding RNAs (lncRNAs) maybe key regulators of the “disease-activity” signature in specific immune cell types.

Task 4 - R basic operations

1. What is the square root of 10?

sqrt(10)
[1] 3.162278

2. What is the logarithm of 32 to the base 2?

log2(32)
[1] 5

3. What is the sum of the numbers from 1 to 1000?

sum(1:1000)
[1] 500500

4. What is the sum of all even numbers from 2 to 1000?

sum(seq(2, 1000, by = 2))
[1] 250500

5. How many pairwise comparisons are there for 100 genes?

choose(100, 2)
[1] 4950

6. And how many ways to arrange 100 genes in triples?

choose(100, 3)
[1] 161700

Task 5 - Using R example datasets

1. Use the R internal CO2 dataset (“data(CO2)”).

data(CO2)

2. Describe briefly the content of the CO2 dataset using the help function.

help(CO2)

CO2 uptake of six plants in different CO2 concentration from Quebec and Mississippi.

3. What is the average and median CO2 uptake of the plants from Quebec and Mississippi?

summary(CO2$uptake)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.70   17.90   28.30   27.21   37.12   45.50 

The average of the whole dataset is 27.21 and median is 28.30.

suppressMessages({library(tidyverse)})

CO2 %>% filter(Type == "Quebec") %>% pull(uptake) %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.30   30.32   37.15   33.54   40.15   45.50 
CO2 %>% filter(Type == "Mississippi") %>% pull(uptake) %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.70   13.88   19.30   20.88   28.05   35.50 

The average for Quebec is 33.54 and median is 37.15. The average for Mississippi is 20.88 and median is 19.30.

Task 6 - R Functions

1. Write a function that calculates the ratio of the mean and the median of a given vector.

func <- function(v) {
  return(
    as.numeric(summary(v)["Mean"] / summary(v)["Median"]) %>%
      `names<-`("Ratio of the mean and the median")
  ) 
}

func(CO2$uptake)
Ratio of the mean and the median 
                       0.9615935 

2. Write a function that ignores the lowest and the highest value from a given vector and calculate the mean.

fuc <- function(v) {
  v <- sort(v)
  v <- v[-c(1,length(v))]
  return(mean(v))
}
fuc(c(2, 1, 100, 3))
[1] 2.5

3. Read about piping from here:https://r4ds.had.co.nz/pipes.html#pipes (you don’t have to learn everything, a basic understanding of the usage is enough). Write a short (max. 300 characters, no spaces) explanation of why, how, and when not to use pipes.

The pipe %>% in R improves readability by chaining operations left to right, reducing nested function calls. Use it when transforming data step by step. Avoid it for very simple code, deeply nested functions, or debugging, as pipes can obscure intermediate results and error sources.

4. Familiarize yourself with the apply-family of functions (apply, lapply, sapply etc.) http://uc-r.github.io/apply_family Write a short explanation (max. 300 characters, no spaces) of why they could be useful in your work.

The apply-family functions make it possible to apply a function over elements of vectors, lists, or rows/columns of matrices without explicit loops, making code cleaner, faster, and more efficient for data analysis, paticularly when we have more than one samples, count matrix, data frames, etc.

Task 7 - Basic visualization with R

suppressMessages({library(nummenmaa)})

1. Compare the distributions of the body heights of the two species from the ‘magic_guys.csv’ dataset graphically

a. using the basic ‘hist’ function

magic_guys <- read_csv("magic_guys.csv", show_col_types = FALSE)
jedi <- magic_guys %>% filter(species == "jedi")
sith <- magic_guys %>% filter(species == "sith")


hist(jedi$length, 
     main = "Height Distribution of Jedi",
     xlab = "Height", 
     col = "lightblue")

hist(sith$length, 
     main = "Height Distribution of ",
     xlab = "Height", 
     col = "lightblue")

hist(jedi$length, 
     main = "Height Distribution of Jedi",
     breaks = seq(min(jedi$length), max(jedi$length), length.out = 10),
     xlab = "Height", 
     col = "lightblue")

hist(sith$length, 
     main = "Height Distribution of Sith",
     breaks = seq(min(sith$length), max(sith$length), length.out = 10),
     xlab = "Height", 
     col = "lightblue")

b. using ggplot2-package

p1 <- ggplot(magic_guys, aes(x = species, y = length, fill = species)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  labs(title = "Height Distribution by Species",
       x = "Species",
       y = "Height") +
  theme_test() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("lightblue", "lightcoral"))

p2 <- ggplot(magic_guys, aes(x = species, y = length, fill = species)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1.5) +
  labs(title = "Height Distribution with Individual Data Points",
       x = "Species",
       y = "Height") +
  theme_test() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("lightblue", "lightcoral"))

print(p1) ; print(p2)

c. Save the plots

# pdf("xxxxxxxxx", height = 5, width = 4) # save as pdf
# dev.off()
# ggsave("xxxxxxxxx", height = 5, width = 4) # save as pdf/svg/png

PDFs are best for high-quality vector graphics, PNG for web use or raster images, and SVG for scalable graphics in interactive or web contexts.

2. Load the gene expression data matrix from the ‘microarray_data.tab’ dataset provided in the shared folder, it is a big tabular separated matrix.

a. How big is the matrix in terms of rows and columns?

microarray <- read_tsv("microarray_data.tab", show_col_types = FALSE)
dim(microarray)
[1]  551 1000

The matrix is 551 rows and 1000 columns.

b. Count the missing values per gene and visualize this result.

data <- microarray %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "column", values_to = "missing_count") %>%
  mutate(missing_pct = (missing_count / nrow(.)) * 100)

ggplot(data, aes(y = missing_count)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  labs(title = "Missing values per gene",
       x = "",
       y = "Missing count") +
  theme_test()

c. Find the genes for which there are more than X% (X=10%, 20%, 50%) missing values.

map(c(10, 20, 50), ~{
  data %>% filter(missing_pct > .x) %>% pull(column)
})
[[1]]
  [1] "g1"   "g2"   "g12"  "g14"  "g18"  "g21"  "g22"  "g24"  "g25"  "g26" 
 [11] "g29"  "g39"  "g40"  "g41"  "g46"  "g48"  "g49"  "g51"  "g52"  "g53" 
 [21] "g54"  "g55"  "g56"  "g57"  "g58"  "g59"  "g60"  "g61"  "g62"  "g63" 
 [31] "g64"  "g65"  "g66"  "g67"  "g68"  "g69"  "g70"  "g71"  "g72"  "g73" 
 [41] "g74"  "g75"  "g76"  "g77"  "g78"  "g79"  "g80"  "g81"  "g82"  "g83" 
 [51] "g84"  "g85"  "g86"  "g87"  "g88"  "g89"  "g90"  "g91"  "g92"  "g93" 
 [61] "g94"  "g95"  "g96"  "g97"  "g98"  "g99"  "g100" "g101" "g102" "g103"
 [71] "g104" "g105" "g106" "g107" "g108" "g109" "g110" "g111" "g112" "g113"
 [81] "g114" "g115" "g116" "g117" "g118" "g119" "g120" "g130" "g131" "g132"
 [91] "g133" "g134" "g135" "g136" "g137" "g138" "g139" "g140" "g142" "g147"
[101] "g148" "g151" "g152" "g153" "g154" "g155" "g156" "g157" "g158" "g159"
[111] "g160" "g165" "g171" "g172" "g173" "g174" "g175" "g176" "g177" "g178"
[121] "g179" "g180" "g196" "g200" "g204" "g210" "g233" "g252" "g260" "g290"
[131] "g296" "g297" "g301" "g321" "g329" "g332" "g333" "g334" "g335" "g344"
[141] "g351" "g352" "g353" "g354" "g355" "g356" "g357" "g358" "g359" "g360"
[151] "g361" "g362" "g363" "g364" "g365" "g366" "g367" "g368" "g369" "g370"
[161] "g379" "g381" "g382" "g383" "g384" "g385" "g386" "g387" "g388" "g389"
[171] "g390" "g391" "g396" "g400" "g401" "g402" "g403" "g404" "g405" "g406"
[181] "g407" "g408" "g409" "g410" "g415" "g417" "g418" "g431" "g432" "g433"
[191] "g434" "g435" "g436" "g437" "g438" "g439" "g440" "g445" "g450" "g455"
[201] "g461" "g462" "g463" "g464" "g465" "g466" "g467" "g468" "g469" "g470"
[211] "g476" "g479" "g491" "g492" "g493" "g494" "g495" "g496" "g497" "g498"
[221] "g499" "g500" "g510" "g513" "g515" "g518" "g519" "g520" "g522" "g527"
[231] "g531" "g532" "g533" "g534" "g535" "g536" "g537" "g538" "g539" "g540"
[241] "g544" "g547" "g558" "g561" "g569" "g570" "g571" "g572" "g573" "g574"
[251] "g575" "g576" "g577" "g578" "g579" "g580" "g583" "g585" "g586" "g591"
[261] "g595" "g599" "g611" "g612" "g613" "g614" "g615" "g616" "g617" "g618"
[271] "g619" "g620" "g650" "g657" "g663" "g669" "g689" "g691" "g694" "g707"
[281] "g711" "g744" "g751" "g752" "g753" "g754" "g755" "g756" "g757" "g758"
[291] "g759" "g760" "g761" "g762" "g763" "g764" "g765" "g766" "g767" "g768"
[301] "g769" "g770" "g781" "g782" "g783" "g784" "g785" "g786" "g787" "g788"
[311] "g789" "g790" "g801" "g802" "g803" "g804" "g805" "g806" "g807" "g808"
[321] "g809" "g810" "g814" "g822" "g831" "g832" "g833" "g834" "g835" "g836"
[331] "g837" "g838" "g839" "g840" "g849" "g850" "g851" "g854" "g861" "g862"
[341] "g863" "g864" "g865" "g866" "g867" "g868" "g869" "g870" "g872" "g891"
[351] "g892" "g893" "g894" "g895" "g896" "g897" "g898" "g899" "g900" "g909"
[361] "g910" "g919" "g926" "g931" "g932" "g933" "g934" "g935" "g936" "g937"
[371] "g938" "g939" "g940" "g947" "g951" "g970" "g971" "g972" "g973" "g974"
[381] "g975" "g976" "g977" "g978" "g979" "g980" "g985" "g989"

[[2]]
  [1] "g18"  "g21"  "g29"  "g41"  "g48"  "g51"  "g52"  "g53"  "g55"  "g58" 
 [11] "g59"  "g60"  "g62"  "g63"  "g64"  "g66"  "g73"  "g74"  "g75"  "g76" 
 [21] "g77"  "g79"  "g82"  "g83"  "g84"  "g88"  "g89"  "g90"  "g91"  "g92" 
 [31] "g93"  "g94"  "g96"  "g97"  "g98"  "g99"  "g100" "g101" "g104" "g105"
 [41] "g106" "g107" "g108" "g109" "g112" "g113" "g114" "g115" "g116" "g118"
 [51] "g119" "g120" "g131" "g132" "g135" "g137" "g138" "g153" "g157" "g165"
 [61] "g171" "g172" "g174" "g176" "g178" "g179" "g204" "g252" "g260" "g290"
 [71] "g301" "g329" "g351" "g352" "g353" "g355" "g356" "g357" "g358" "g359"
 [81] "g360" "g362" "g363" "g364" "g367" "g368" "g370" "g379" "g381" "g382"
 [91] "g383" "g384" "g385" "g386" "g387" "g388" "g389" "g390" "g391" "g396"
[101] "g401" "g404" "g405" "g406" "g410" "g417" "g418" "g431" "g432" "g433"
[111] "g436" "g437" "g438" "g439" "g440" "g461" "g462" "g465" "g466" "g469"
[121] "g470" "g491" "g496" "g497" "g498" "g515" "g518" "g519" "g527" "g531"
[131] "g532" "g535" "g536" "g537" "g538" "g547" "g558" "g561" "g569" "g571"
[141] "g572" "g573" "g575" "g576" "g577" "g585" "g612" "g613" "g614" "g615"
[151] "g617" "g618" "g619" "g650" "g657" "g663" "g669" "g689" "g751" "g753"
[161] "g754" "g756" "g758" "g759" "g760" "g763" "g765" "g766" "g768" "g782"
[171] "g783" "g784" "g785" "g786" "g787" "g788" "g789" "g801" "g802" "g804"
[181] "g806" "g807" "g808" "g809" "g810" "g832" "g833" "g836" "g838" "g851"
[191] "g854" "g862" "g863" "g864" "g865" "g867" "g869" "g870" "g892" "g893"
[201] "g896" "g897" "g898" "g899" "g900" "g919" "g932" "g933" "g934" "g936"
[211] "g937" "g939" "g971" "g972" "g973" "g974" "g975" "g976" "g979" "g980"

[[3]]
 [1] "g18"  "g55"  "g58"  "g79"  "g83"  "g99"  "g135" "g137" "g138" "g329"
[11] "g352" "g368" "g389" "g390" "g417" "g431" "g498" "g519" "g527" "g531"
[21] "g538" "g576" "g577" "g615" "g751" "g753" "g788" "g802" "g838" "g864"
[31] "g892" "g893" "g919" "g971"

d. Replace the missing values by the average expression value for the particular gene.

microarray <- microarray %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

3.Visualize the data in the CO2 dataset in a way that gives you a deeper understanding of the data. What do you see?

suppressMessages({library(patchwork)})
p1 <- ggplot(CO2, aes(x = uptake)) +
  geom_histogram(bins = 20, fill = "lightblue", alpha = 0.7, color = "black") +
  labs(title = "Distribution of CO2 Uptake Rates",
       x = "CO2 Uptake Rate (μmol/m²·s)",
       y = "Frequency") +
  theme_minimal()

p2 <- ggplot(CO2, aes(x = uptake, fill = Type)) +
  geom_histogram(bins = 15, alpha = 0.7, position = "identity") +
  facet_grid(Treatment ~ Type) +
  labs(title = "CO2 Uptake Distribution by Origin and Treatment",
       x = "CO2 Uptake Rate (μmol/m²·s)",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "bottom")

p3 <- ggplot(CO2, aes(x = conc, y = uptake)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "loess", se = TRUE, color = "red", linewidth = 1.2) +
  labs(title = "CO2 Uptake vs Concentration (Overall Pattern)",
       x = "CO2 Concentration (mL/L)",
       y = "CO2 Uptake Rate (μmol/m²·s)") +
  theme_minimal()

p4 <- ggplot(CO2, aes(x = conc, y = uptake, color = Type)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "loess", se = FALSE, linewidth = 1.2) +
  facet_wrap(~Treatment) +
  labs(title = "CO2 Uptake vs Concentration by Origin and Treatment",
       x = "CO2 Concentration (mL/L)",
       y = "CO2 Uptake Rate (μmol/m²·s)") +
  theme_minimal() +
  scale_color_manual(values = c("Quebec" = "blue", "Mississippi" = "red")) +
  theme(legend.position = "bottom")

p5 <- ggplot(CO2, aes(x = conc, y = uptake, color = Type)) +
  geom_line(aes(group = Plant), alpha = 0.7, linewidth = 0.8) +
  geom_point(alpha = 0.5, size = 1.5) +
  facet_grid(Treatment ~ Type) +
  labs(title = "Individual Plant CO2 Response Curves",
       x = "CO2 Concentration (mL/L)",
       y = "CO2 Uptake Rate (μmol/m²·s)",
       subtitle = "Each line represents one plant") +
  theme_minimal() +
  theme(legend.position = "none")

p6 <- ggplot(CO2, aes(x = Treatment, y = uptake, fill = Treatment)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
  facet_wrap(~Type) +
  labs(title = "CO2 Uptake Comparison: Chilled vs Non-chilled",
       x = "Treatment",
       y = "CO2 Uptake Rate (μmol/m²·s)") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("nonchilled" = "orange", "chilled" = "lightblue"))

p7 <- ggplot(CO2, aes(x = Type, y = uptake, fill = Type)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
  facet_wrap(~Treatment) +
  labs(title = "CO2 Uptake Comparison: Quebec vs Mississippi Origins",
       x = "Plant Origin",
       y = "CO2 Uptake Rate (μmol/m²·s)") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("Quebec" = "darkblue", "Mississippi" = "darkred"))

p8 <- ggplot(CO2, aes(x = factor(conc), y = uptake, fill = interaction(Type, Treatment))) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "CO2 Uptake at Different Concentration Levels",
       x = "CO2 Concentration (mL/L)",
       y = "CO2 Uptake Rate (μmol/m²·s)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title = element_blank()) +
  scale_fill_manual(values = c("Quebec.nonchilled" = "lightblue",
                              "Quebec.chilled" = "blue",
                              "Mississippi.nonchilled" = "lightcoral",
                              "Mississippi.chilled" = "red"))

interaction_data <- CO2 %>%
  group_by(Type, Treatment, conc) %>%
  summarise(mean_uptake = mean(uptake),
            se_uptake = sd(uptake) / sqrt(n()),
            .groups = 'drop')
p9 <- ggplot(interaction_data, aes(x = conc, y = mean_uptake, color = Type, linetype = Treatment)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_uptake - se_uptake, ymax = mean_uptake + se_uptake), 
                width = 20, alpha = 0.7) +
  labs(title = "Mean CO2 Uptake Response with Standard Errors",
       x = "CO2 Concentration (mL/L)",
       y = "Mean CO2 Uptake Rate (μmol/m²·s)",
       subtitle = "Error bars represent standard error of the mean") +
  theme_minimal() +
  scale_color_manual(values = c("Quebec" = "blue", "Mississippi" = "red")) +
  theme(legend.position = "bottom")

wrap_plots(p1, p2, p3, p4, p5, p6, p7, p8, p9, nrow = 2)

Task 8

library(tidybiology)
data("chromosome")
data("proteins")

a. Extract summary statistics (mean, median and maximum) for the following variables from the ‘chromosome’ data: variations, protein coding genes, and miRNAs. Utilize the tidyverse functions to make this as simply as possible.

# Extract summary statistics using tidyverse functions
chromosome_summary <- chromosome %>%
  summarise(
    # Variations statistics
    variations_mean = mean(variations, na.rm = TRUE),
    variations_median = median(variations, na.rm = TRUE),
    variations_max = max(variations, na.rm = TRUE),
    
    # Protein coding genes statistics
    protein_codinggenes_mean = mean(protein_codinggenes, na.rm = TRUE),
    protein_codinggenes_median = median(protein_codinggenes, na.rm = TRUE),
    protein_codinggenes_max = max(protein_codinggenes, na.rm = TRUE),
    
    # miRNAs statistics
    miRNAs_mean = mean(mi_rna, na.rm = TRUE),
    miRNAs_median = median(mi_rna, na.rm = TRUE),
    miRNAs_max = max(mi_rna, na.rm = TRUE)
  )

# Display results in a more readable format
summary_long <- chromosome_summary %>%
  pivot_longer(
    cols = everything(),
    names_to = c("variable", "statistic"),
    names_sep = "_(?=[^_]+$)",  # Split on the last underscore
    values_to = "value"
  ) %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  ) %>%
  mutate(
    variable = case_when(
      variable == "variations" ~ "Variations",
      variable == "protein_codinggenes" ~ "Protein Coding Genes",
      variable == "miRNAs" ~ "miRNAs"
    ),
    across(c(mean, median, max), ~round(.x, 2))
  )
print(summary_long)
# A tibble: 3 × 4
  variable                  mean  median      max
  <chr>                    <dbl>   <dbl>    <dbl>
1 Variations           6484572.  6172346 12945965
2 Protein Coding Genes     850.      836     2058
3 miRNAs                    73.2      75      134

b. How does the chromosome size distribute? Plot a graph that helps to visualize this by using ggplot2 package functions.

p1 <- ggplot(chromosome, aes(x = length_mm)) +
  geom_histogram(bins = 15, fill = "steelblue", alpha = 0.7, color = "black") +
  geom_vline(aes(xintercept = mean(length_mm)), 
             color = "red", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median(length_mm)), 
             color = "orange", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Human Chromosome Sizes",
       x = "Chromosome Length (mm)",
       y = "Frequency",
       subtitle = "Red line: Mean, Orange line: Median") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 10))

p2 <- ggplot(chromosome, aes(y = length_mm)) +
  geom_boxplot(fill = "lightcoral", alpha = 0.7, width = 0.5) +
  geom_jitter(aes(x = 0), width = 0.1, alpha = 0.6, size = 2, color = "darkred") +
  labs(title = "Chromosome Size Distribution (Boxplot)",
       y = "Chromosome Length (mm)") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

p3 <- chromosome %>%
  arrange(desc(length_mm)) %>%
  mutate(id = factor(id, levels = id)) %>%
  ggplot(aes(x = id, y = length_mm)) +
  geom_col(fill = "darkorchid", alpha = 0.8) +
  labs(title = "Individual Chromosome Sizes",
       x = "Chromosome",
       y = "Length (mm)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

wrap_plots(p1, p2, p3, nrow = 1)

c. Does the number of protein coding genes or miRNAs correlate with the length of the chromosome? Make two separate plots to visualize these relationships.

cor_protein_genes <- cor(chromosome$length_mm, chromosome$protein_codinggenes, use = "complete.obs")
cor_mirnas <- cor(chromosome$length_mm, chromosome$mi_rna, use = "complete.obs")

p4 <- ggplot(chromosome, aes(x = length_mm, y = protein_codinggenes)) +
  geom_point(size = 3, alpha = 0.7, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.2) +
  labs(title = "Protein Coding Genes vs Chromosome Length",
       x = "Chromosome Length (mm)",
       y = "Number of Protein Coding Genes",
       subtitle = paste("Correlation coefficient:", round(cor_protein_genes, 3))) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12))

p5 <- ggplot(chromosome, aes(x = length_mm, y = mi_rna)) +
  geom_point(size = 3, alpha = 0.7, color = "darkblue") +
  geom_smooth(method = "lm", se = TRUE, color = "orange", linewidth = 1.2) +
  labs(title = "miRNAs vs Chromosome Length",
       x = "Chromosome Length (mm)",
       y = "Number of miRNAs",
       subtitle = paste("Correlation coefficient:", round(cor_mirnas, 3))) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12))

wrap_plots(p4, p5, nrow = 1)

d. Calculate the same summary statistics for the ‘proteins’ data variables length and mass. Create a meaningful visualization of the relationship between these two variables by utilizing the ggplot2 package functions.

proteins_summary <- proteins %>%
  summarise(
    across(c(length, mass), 
           list(mean = ~mean(.x, na.rm = TRUE),
                median = ~median(.x, na.rm = TRUE),
                max = ~max(.x, na.rm = TRUE)),
           .names = "{.col}_{.fn}")
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("variable", "statistic"),
    names_sep = "_",
    values_to = "value"
  ) %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  ) %>%
  mutate(
    variable = case_when(
      variable == "length" ~ "Protein Length",
      variable == "mass" ~ "Protein Mass"
    ),
    across(c(mean, median, max), ~round(.x, 2))
  )

proteins_summary
suppressMessages({library(scales)})

cor_length_mass <- cor(proteins$length, proteins$mass, use = "complete.obs")

p6 <- proteins %>%
  filter(!is.na(length), !is.na(mass)) %>%
  filter(length < quantile(length, 0.99, na.rm = TRUE),
         mass < quantile(mass, 0.99, na.rm = TRUE)) %>%
  ggplot(aes(x = length, y = mass)) +
  geom_hex(alpha = 0.8, bins = 30) +
  geom_smooth(method = "lm", se = TRUE, color = "red", linewidth = 1.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Count") +
  labs(title = "Relationship Between Protein Length and Mass",
       x = "Protein Length (amino acids)",
       y = "Protein Mass (Da)",
       subtitle = paste("Correlation coefficient:", round(cor_length_mass, 3))) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "darkblue"),
    plot.subtitle = element_text(hjust = 0.5, size = 12, color = "gray40"),
    axis.title = element_text(size = 12, face = "bold"),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "gray98", color = NA),
    plot.background = element_rect(fill = "white", color = NA)
  ) +
  scale_x_continuous(labels = comma_format()) +
  scale_y_continuous(labels = comma_format())

p7 <- proteins %>%
  filter(!is.na(length), !is.na(mass)) %>%
  filter(length < quantile(length, 0.95, na.rm = TRUE),
         mass < quantile(mass, 0.95, na.rm = TRUE)) %>%
  ggplot(aes(x = length, y = mass)) +
  geom_point(alpha = 0.3, size = 0.8, color = "steelblue") +
  geom_density_2d_filled(alpha = 0.6, contour_var = "ndensity") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linewidth = 2) +
  scale_fill_viridis_d(name = "Density", option = "plasma") +
  labs(title = "Protein Length vs Mass Relationship",
       x = "Protein Length (amino acids)",
       y = "Protein Mass (Da)",
       subtitle = "Density contours show data distribution patterns") +
  theme_dark() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "white"),
    plot.subtitle = element_text(hjust = 0.5, size = 11, color = "gray80"),
    axis.title = element_text(size = 12, face = "bold", color = "white"),
    axis.text = element_text(color = "white"),
    legend.title = element_text(color = "white"),
    legend.text = element_text(color = "white"),
    panel.grid.major = element_line(color = "gray30", size = 0.3),
    panel.grid.minor = element_blank(),
    legend.background = element_rect(fill = "gray10", color = NA),
    plot.background = element_rect(fill = "gray10", color = NA)
  ) +
  scale_x_continuous(labels = comma_format()) +
  scale_y_continuous(labels = comma_format())

wrap_plots(p6, p7, nrow = 1)