Time20.07Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7
Statistics for data scienceBasics of machine learning (ML)Prompting techniques and LLM APIsTime series: Forecasting, XAI, and databasesLLMs-based data processingLM-based agents for data scienceData pipeline scheduling on the computing continuumDepartures
Linear models for classificationData-driven decision makingGraph data managementConversational AIFindable, Accessible, Interoperable, Reusable (FAIR) dataOperationalizing data and ML pipelines
Unlocking Business Value with Data Science – OMV Petrom’s Approach
ArrivalsLunch breakLunch break
Lunch break
Statistical learningApplied Deep LearningCausal AIGraph data analyticsSocial eventBest practices in data sharingDeployment, orchestration, monitoring of data and ML pipelines
Basics of Large Language Models (LLMs)Time series analysis and forecastingData enrichmentHigh performance data processing
Intro eventFree time

Detailed information about the activities

Statistics for data science (Dan Nicolae)

  • A data science case study
  • Foundations of data analysis
  • Statistical inference with resampling methods

Statistical learning (Dan Nicolae)

  • Probability and simulations
  • Regression models and inference
  • Model Complexity
  • Prediction and classification

Basics of machine learning (ML) (Razvan Bunescu)

  • Feature vector representations
  • Occam’s razor for ML, intelligence, and science
  • Overfitting, underfitting, generalization, and regularization
  • ML experiments: training, validation, and testing

Linear models for classification (Razvan Bunescu)

  • Logistic regression, softmax, and temperature
  • ML algorithms in Python: the sklearn library
  • Linear vs. non-linear classification and deep learning

Applied Deep Learning (Gabriel Terejanu)

  • From linear models to neural networks
  • Why deep neural networks?
  • What is an embedding? 
  • How to make use of pre-trained models?

Basics of Large Language Models (LLMs) (Razvan Bunescu)

  • Subword tokenization, word embeddings, and neural language models (LMs)
  • Encoder, encoder-decoder, and decoder LMs
  • Pre-training and fine-tuning

Prompting techniques and LLM APIs (Razvan Bunescu)

  • Zero-shot and few-shot in-context learning
  • Chain-of-thought prompting, retrieval augmented generation, and ReAct
  • The chat completion API and LangChain

Data-driven decision making (Gabriel Terejanu)

  • Introduction to A/B testing for decision making
  • Designing effective A/B tests
  • Analytical techniques in A/B testing

Causal AI (Gabriel Terejanu)

  • Importance of causality in AI
  • What is a causal model?
  • What is an intervention?
  • How to estimate causal effects?

Time series analysis and forecasting (Jože Rožanec)

  • Introduction to time series
  • Analysis tools and real-world examples
  • Time series forecasting 

Time series: Forecasting, XAI, and databases (Jože Rožanec)

  • Using network models to represent and forecast time series
  • Introduction to explainability methods
  • Introduction to time series databases

Graph data management (Dumitru Roman)

  • Intro to graph data structure
  • Knowledge Graphs
  • Graph data management (graph databases with Noe4j, graph data model, graph construction and querying)

Graph data analytics (Daniel Schroeder, Dumitru Roman)

  • Complex Network and Graph Analysis in IGraph
  • Intro to Graph Neural Networks in PyG
  • Graph data visualization

Data enrichment (Roberto Avogadro, Dumitru Roman)

  • Data linking
  • Tabular data enrichment 
  • Human-in-the-Loop (HITL) for data enrichment

LLMs-based data processing (Ioan Toma)

  • Introduction to LLM-based Data Processing
  • Knowledge Extraction using LLMs
  • Document Classification, Summarization and Comparison using LLMs

Conversational AI (Ioan Toma)

  • Conversational AI setup and designing a chatbot interface
  • Semantic Knowledge Graphs and their role in Conversational AI
  • Building a chatbot using Onlim Conversational AI framework

LLM-based agents for data science (Hui Song)

  • Use of ChatGPT Data Analyst to process data files, generate data processing code
  • Development of LLM-based agents for multi-phase data processing tasks
  • Multi-agents for complex and collaborative data processing

Findable, Accessible, Interoperable, Reusable (FAIR) data (Anna Fensel)

  • Introduction to FAIR data. Examples from agri-food and health domains
  • How to make data FAIR? Open data, closed data and everything in between
  • Research data infrastructures

Best practices in data sharing (Anna Fensel)

  • Legal compliance (GDPR, AI Act, Data Act)
  • Consent, contracts and licenses, empowered with knowledge graphs
  • Incentivising data sharing

High performance data processing (Radu Prodan)

  • Parallel computing architectures
  • Multiprocessing
  • Parallel algorithms
  • Parallel computing for AI and data science

Data pipeline scheduling on the computing continuum (Nikolay Nikolov)

  • Introduction to pipeline scheduling in the context of big data and distributed applications 
  • The importance and challenges of pipeline scheduling
  • Solutions and practical approaches to pipeline scheduling

Operationalizing data and ML pipelines (Wiktor Sowinski-Mydlarz)

  • Contemporary Data Processing
  • GATE Institute Data Platform
  • Alternatives and Decisions
  • The Lifecycle

Deployment, orchestration, monitoring of data and ML pipelines (Wiktor Sowinski-Mydlarz)

  • Data Spaces: Decentralized Supply and Consumption of Data Services
  • Private Cloud for Big Data Processing
  • Platform support
  • Resources: Free products and software bibles

Software (preliminary): Software tools/services to be used during the sessions include: